I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it's very precisely processed.
It's like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don't think that's the intention for it to continue.
I wonder if they avoid that due to the potential for negative press from the outputs of a more "raw" model.
Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?
While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.
I'd expect that an AI that has seen billions of chess positions, and the moves played in them, can figure out the rules for legal moves without being told?
So much of the training data (eg common crawl, pile, Reddit) is dogshit, so it generates reheated dogshit.
Also what does a normal human do? It looks around how to move one random piece and it uses a very small dictionary / set of basic rules to move it. I do not remember me learning to count every piece and its options by looking up that rulebook. I learned to 'see' how i can move one type of chess piece.
If a LLM uses only these piece moves on a mathematical level, it would do the same thing as i do.
And yes there is also absolutly the option for an LLM to learn some kind of meta game.
There is plenty of AI which learns the rules of games like Alpha Zero.
LLMs might not have the architecture to 'learn', but it also might. If it optimizes all possible moves one chess peace can do (which is not that much to learn) it can easily only 'move' from one game set to another by this type of dictionary.
After spending some years raising my children I gave up the notion that humans are data efficient. It takes a mind numbing amount of training to get them to learn the most basic skills.
Does a child remotely know as much as ChatGPT? Is it able to reason remotely as well?
That would be like alien archaeologists of the future finding a chess board and some pieces in a capsule orbiting Mars after the total destruction of Earth and all recorded human thought. The archaeologists could invent their own games to play on the chess board but they’d have no way of ever knowing they were playing chess.
Interestingly, Bobby Fischer did it in the same way. Maybe AlphaZero also hates chess ? :-)
> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess. If this doesn’t convince you, I encourage you to write a program that can take strings like 1. e4 d5 2. exd5 Qxd5 3. Nc3 and then say if the last move was legal.
However, I can't say if LLMs fall in the "statistical AI" category.
You can also say humans are "just XYZ biological system," but that doesn't mean they don't reason. The same goes for LLMs.
A human doesn’t use next token prediction to solve word problems.
The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.
If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.
However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).
Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.
LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.
You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.
You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.
I'm not saying that your observations aren't correct, but this is not a binary. It is entirely possible that the tasks you observe the models on are exactly the kind where they tend to regurgitate. But that doesn't mean that it is all they can do.
Ultimately, the question is whether there is a "there" there at all. Even if 9 times out of 10, the model regurgitates, but that one other time it can actually reason, that means that it is capable of reasoning in principle.
No, the human brain does not "understand" language. It just knows how to control the firing of neurons that control the vocal chords, in order to maximize an endocrine reward function that has evolved to maximize biological fitness.
I can speak about human brains the same way you speak about LLMs. I'm sure you can spot the problem in my conclusions: just because the human brain is "only" firing neurons, it does actually develop an understanding of the world. The same goes for LLMs and next-word prediction.
We see the LLM sometimes do sort of well at a whole bunch of tasks. But it makes silly mistakes that seem obvious to us. We say, “Ah ha! So it can’t reason after all”.
Say LLMs get a bit better, to the point they can beat chess grandmasters 55% of the time. This is quite good. Low level chess players rarely ever beat grandmasters, after all. But, the LLM spits out illegal moves sometimes and sometimes blunders nonsensically. So we say, “Ah ha! So it can’t reason after all”.
But what would it matter if it can reason? Beating grandmasters 55% of the time would make it among the best chess players in the world.
For now, LLMs just aren’t that good. They are too error prone and inconsistent and nonsensical. But they are also sort weirdly capable at lots of things in strange inconsistent ways, and assuming they continue to improve, I think they will tend to defy our typical notions of human intelligence.
Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.
Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.
This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".
E.g. I do contract work on an LLM-related project where one of the systemic changes introduced - in addition to multiple levels of quality checks - is to force to make people input a given sentence word for word followed by a word from a set of 5 or so, and a minority of the submissions get that sentence correct including the final word despite the system refusing to let you submit unless the initial sentence is correct. Seeing the data has been an absolutely shocking indictment of human reasoning.
These are submissions from a pool of people who have passed reasoning tests...
When I've tested the process myself as well, it takes only a handful of steps before the tendency is to "drift off" and start replacing a word here and there and fail to complete even the initial sentence without a correction. I shudder to think how bad the results would be if there wasn't that "jolt" to try to get people back to paying attention.
Keeping humans consistently carrying out a learned process is incredibly hard.
I think this is circular?
If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?
Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?
I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.
I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.
LLMs must be prompted for everything and don’t act on their own.
The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.
It’s dangerous to put so much faith in what in reality is a very good guessing machine. You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.
Can you elaborate on the difference? Are you bringing sentience into it? It kind of sounds like it from "don't act on their own". But reasoning and sentience are wildly different things.
> It’s dangerous to put so much faith in what in reality is a very good guessing machine
Yes, exactly. That's why I think it is good we are supplementing fallible humans with fallible LLMs; we already have the processes in place to assume that not every actor is infallible.
Do we blindly trust or believe every single thing we hear from another person? Of course not. But hearing what they have to say can still be fruitful, and it is not like we have an oracle at our disposal who always speaks the absolute truth, either. We make do with what we have, and LLMs are another tool we can use.
They’ll fail in different ways than something that thinks (and doesn’t have some kind of major disease of the brain going on) and often smack in the middle of appearing to think.
Can, but don't by default. Just as LLMs can be asked for chain of thought, but the default for most users is just chat.
This behaviour of humans is why we software developers have daily standup meetings, version control, and code review.
> LLMs must be prompted for everything and don’t act on their own
And this is why we humans have task boards like JIRA, and quarterly goals set by management.
As for people communicating each step, we have plenty of experiments showing that it's pretty hard to get people to reliably report what they actually do as opposed to a rationalization of what they've actually done (e.g. split brain experiments have shown both your brain halves will happily lie about having decided to do things they haven't done if you give them reason to think they've done something)
You can categorically not trust peoples reasoning about "why" they've made a decision to reflect what actually happened in their brain to make them do something.
Throw all the math problems you want at a LLM for training; it will still fail if you step outside of the familiar.
To which I say:
ᛋᛟ᛬ᛞᛟ᛬ᚻᚢᛗᚪᚾᛋ
ᛁ᛬ᚻᚪᚹᛖ᛬ᛟᚠᛏᛖᚾ᛬ᛋᛖᛖᚾ᛬ᛁᚾ᛬ᛞᛁᛋᚲᚢᛋᛋᛁᛟᚾᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚦᛁᛋ᛬ᚲᛚᚪᛁᛗᛋ᛬ᚦᚪᛏ᛬ᚻᚢᛗᚪᚾ᛬ᛗᛁᚾᛞᛋ᛬ᚲᚪᚾ᛬ᛞᛟ᛬ᛁᛗᛈᛟᛋᛋᛁᛒᛚᛖ᛬ᚦᛁᛝᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚷᛖᚾᛖᚱᚪᛚᛚᚣ᛬ᛋᛟᛚᚹᛖ᛬ᚦᛖ᛬ᚻᚪᛚᛏᛁᛝ᛬ᛈᚱᛟᛒᛚᛖᛗ
edit: Snap, you said the same in your other comment :)
It seems to me that the idea of the Universal Turing Machine is quite misleading for a lot of people, such as David Deutsch.
My impression is that the amount of compute to solve most problems that can really only be solved by Turing Machines is always going to remain inaccessible (unless they're trivally small).
But at the same time, the universe seems to obey a principle of locality (as long as we only consider the Quantum Wave Function, and don't postulate that it collapses).
Also, the quantum fields are subject to some simple (relative to LLMs) geometric symmetries, such as invariance under the U(1)xSU(2)xSU(3) group.
As it turns out, similar group symmetries can be found in all sorts of places in the real world.
Also it seems to me that at some level, both ANN's and biological brains set up a similar system to this physical reality, which may explain why brains develop this way and why both kinds are so good at simulating at least some aspects of the physical world, such as translation, rotation, some types of deformation, gravity, sound, light etc.
And when biological brains that initially developed to predict the physical world is then use to create language, that language is bound to use the same type of machinere. And this may be why LLM's do language so well with a similar architecture.
The point of UTM's is not to ever use them, but that they're a shortcut to demonstrating Turing completeness because of their simplicity. Once you've proven Turing completeness, you've proven that your system can compute all Turing computable functions and simulate any other Turing complete system, and we don't know of any computable functions outside this set.
My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.
This means a Turing Complete system becomes nearly useless for things like AI. The same is the case inside a human brain, where signals can only travel at around the speed of sound.
Tensor / neuron based systems sacrifice Turing Completeness to gain (many) orders of magnitude more compute speed.
I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.
People like Deutsch are so in love with the theoretical universality of Turing Completeness that he seems to ignore that Turing Complete system might take longer to formulate meaningful thought than the lifetime of a human. And possibly longer than the lifetime of the Sun, for complex ideas.
The fact that so much can be done by systems that are NOT Turing Complete may seem strange. But I argue that since the laws of Physics are local (with laws described by tensors), it should not be such a surprise that systems that computers that perform tensor computations are pretty good at simulating physical reality.
My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.
> I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.
Any system that can emulate any other Turing complete system is Turing complete, so they are Turing complete.
You seem to confuse Turing completeness with a specific way of executing something. Turing completeness is about the theoretical limits on which set of functions a system can execute, not how they execute them.
Not all algorithms can be distributed effectively across multiple threads.
A computer can have 1000 cpu cores, and only be able to use a single one when running such algorithms.
Some other algorithms may be distributed through branch predictions, by trying to run future processing steps ahead of time for each possible outcome of the current processing step. In fact, modern CPU's already do this a lot to speed up processing.
But even branch prediction hits something like a logarithmic wall of diminishing returns.
While you are right that multi core CPU's (or whole data centers) can run such algorithms, that doesn't mean they can run them quickly, hence my claim:
>> My point is that any such system is extremely limited due to how slow
Algorithms that can only utilize a single core seem to be stuck at the GFLOPS scale, regardless of what hardware they run on.
Even if only a small percentage (like 5%) of the code in a program is inherently limited to being single threaded (At best, you will achieve TFlops numbers), this imposes a fatal limitation on computational problems that require very large amounts of computing power. (For instance at the ExaFlop scale or higher.)
THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.
So if you want to do calculations that require, say, 1 ExaFlop (about the raw compute of the human brain) to be fast enough for a given purpose, you need to make almost all compute steps fully parallelizable.
Now that you've restricted your algorithm to no longer require all features of a Turing Complete system, you CAN still run it on Turing Complete CPU's. You're just not MAKING USE of their Turing Completeness. That's just very expensive.
At this point, you may as well build dedicated hardware that do not have all the optimizations that CPU have for single threaded computation, like GPU's or TPU's, and lower your compute cost by 10x, 100x or more (which could be the difference between $500 billion and $billion).
At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.
One way to describe this, is that from the space of all possible algorithms that can run on a Turing Complete system, you've selected a small sub-space of algorithms that can be parallelized.
By doing this trade, you've severely limited what algorithms you can run, in exchange for the ability to get a speedup of 6 orders of magnitude, or more in many cases.
And in order to keep this speedup, you also have to accept other hardware based limitations, such as staying within the amount of GPU memory available, etc.
Sure, you COULD train GPT-5 or Grok-3 on a C-64 with infinite casette tape storage, but it would take virtually forever for the training to finish. So that fact has no practical utility.
I DO realize that the concept of the equivalence of all Turing Complete systems is very beautiful. But this CAN be a distraction, and lead to intuitions that seem to me to be completely wrong.
Like Deutsch's idea that the ideas in a human brain are fundamentally linked to the brain's Turing Completeness. While in reality, it takes years of practice for a child to learn how to be Turing Complete, and even then the child's brain will struggle to do a floating point calculation every 5 minutes.
Meanwhile, joint systems of algorithms and the hardware they run on can do very impressive calculations when ditching some of the requirements of Turing Completeness.
Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.
> THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.
The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.
It's not about whether the algorithms require a Turing complete system, but that Turing completeness proves the equivalence of the upper limit of which set of functions the architecture can compute, and that pretty much any meaningful architecture you will come up with is still Turing complete.
> At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.
If a system can take 3 bits of input and use it to look up 5 bits of output in a table of 30 bits of data, and it is driven by a driver that uses 1 bit of the input as the current/new state, 1 bit for the direction to move the tape, and 3 bits for the symbol to read/write, and that driver processes the left/right/read/write tape operations and loops back, you have Turing complete system (Wolfram's 2-state 3-symbol Turing machine).
So no, you have not left Turing completeness behind, as any function that can map 3 bits of input to 5 bits of output becomes a Turing complete system if you can put a loop and IO mechanism around it.
Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.
I know. That is part of my claim that this "talking about Turing completeness" is a distraction. Specifically because it ignores efficiency/speed.
> Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.
And again, I KNOW that it's utterly trivial to create a Turing Complete system. I ALSO know that a Turing Complete system can perform ANY computation (it pretty much defines what a computation is), given enough time and memory/storage.
But if such a calculation takes 10^6 times longer than necessary, it's also utterly useless to approach it in this way.
Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.
> The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.
This model is intrinsically single threaded, so the global branches/dependencies requirement is trivially satisifed.
Generally, though, if you want to be able to distribute a computation, you have to pretty much never allow the results of a computation of any arbitrary compute thread to affect the next computation on any of the other threads.
NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.
Also, downstream from this is the hardware optimizations that are needed to even run these algorithms. While you _could_ train any of the current LLM's on large CPU clusters, a direct port would require perhaps 1000x more hardware, electricity, etc than running it on GPU's or TPU's.
Not only that, but if the networks being trained (+ some amount of training data) couldn't fit into the fast GPU/TPU memory during training, but instead had to be swapped in and out of system memory or disk, then that would also cause orders of magnitude of slowdown, even if using GPU/TPU's for the training.
In other words, what we're seeing is a trend towards ever increasing coupling between algorithms being run and the hardware they run on.
When I say that thinking in terms of Turing Completeness is a distraction, it doesn't mean it's wrong.
It's just irrelevant.
Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point. Because Turing completeness does not mean all, or most, or even any of your computations need to be carried out like in a UTM. It only means it needs the theoretical ability. They can take any set of shortcuts you want.
I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.
> It only means it needs the theoretical ability.
This is absolutely incorporated in my previous post. Which is why I wrote:
>> Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.
> It only means it needs the theoretical ability. They can take any set of shortcuts you want.
I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.
Only by staing within the subset of all possible algorithms that CAN be properly paralellized (and have the proper hardware to run it) can you perform the number of calculations needed to train something like an LLM.
> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice,
Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.
On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.
Another example would be the brain of a new-born baby. I'm pretty sure such a brain is NOT Turing Complete in any way. It has a very short list of training algorithms that are constanly running as it's developing.
But it can't even run Hello World.
For it to really be Turing Complete, it needs to be able to follow instructions accurately (no halucinations, etc) and also access to infinite storage/tape (or it will be a Finite State Machine). Again, it still doesn't matter if it's Turing Complete in this context.
If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.
I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.
It would’ve been nice to see one of the larger llama models though.
Those results are absent from the conclusion because the conclusion falls apart otherwise.
One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.
Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.
Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.
Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.
Word tokenization alone should fail miserably.
We get better at it over time, as probably most of us can attest.
(they often pattern-match on the farmer/grain/sheep/fox puzzle and start inventing pointless trips ("the farmer returns alone. Then, he crosses again.") in a way that a human wouldn't)
The point is that as the person I replied to pointed out, that LLM's are "next token predictors" is a meaningless dismissal, as they can be both next token predictors and Turing complete, and given that unless reasoning requires functions outside the Turing computable (we know of no way of constructing such functions, or no way for them to exist) calling them "next token predictors" says nothing about their capabilities.
The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.
It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.
Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.
The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.
Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.
I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.
In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
Yes. That was my experience with most human-produced code I ran into professionally, too.
> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
Yes, that sometimes works with humans as well. Although you usually need to provide more specific feedback to nudge them in the right track. It gets tiring after a while, doesn't it?
I keep seeing people say “yeah well I’ve seen humans that can’t do that either.”
What’s the point you’re trying to make?
> I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions
Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow" that the commenter was complaining about, with the additional twist that most developers' breadth of knowledge is going to be limited to a very narrow range of APIs/platforms/etc. whereas these LLMs are able to be comparable to decent programmers in just about any API/language/platform, all at once.
I've written code for thirty years and I wish I had the breadth and depth of knowledge of the free version of ChatGPT, even if I can outsmart it in narrow domains. It is already very decent and I haven't even tried more advanced models like o1-preview.
Is it perfect? No. But it is arguably better than most programmers in at least some aspects. Not every programmer out there is Fabrice Bellard.
The comparison is weird and dehumanizing.
I, personally, have never worked with someone who consistently puts out code that is as bad as LLM generated code either.
> Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"
How could you possibly know that?
All these types of arguments come from a belief that your fellow human is effectively useless.
It’s sad and weird.
> How could you possibly know that?
I worked at four multinationals and saw a bunch of their code. Most of it wasn't "the top answer in stack overflow". Was some of the code written by some of the people better than that? Sure. And a lot of it wasn't, in my opinion.
> All these types of arguments come from a belief that your fellow human is effectively useless.
Not at all. I think the top answers in stack overflow were written by humans, after all.
> It’s sad and weird.
You are entitled to your own opinion, no doubt about it.
These are called "code reviews" and we do that amongst human coders too, although they tend to be less Socratic in nature.
I think it has been clear from day one that LLMs don't display superhuman capabilities, and a human expert will always outdo one in tasks related to their particular field. But the breadth of their knowledge is unparalleled. They're the ultimate jacks-of-all-trades, and the astonishing thing is that they're even "average Joe" good at a vast number of tasks, never mind "fresh college graduate" good.
The real question has been: what happens when you scale them up? As of now it appears that they scale decidedly sublinearly, but it was not clear at all two or three years ago, and it was definitely worth a try.
One of the things I find extremely frustrating is that almost no research on LLM reasoning ability benchmarks them against average humans.
Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.
If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.
What’s the point of your argument?
AI companies: “There’s a new machine that can do reasoning!!!”
Some people: “actually they’re not very good at reasoning”
Some people like you: “well neither are humans so…”
> research on LLM reasoning ability benchmarks them against average humans
Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.
> Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.
So what? How does that assumption make LLMs better?
> Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.
They're currently regularly being benchmarked against expectations most humans can't meet. It'd make the models look a whole lot better.
The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).
There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.
I think this structure makes it quite hard to really say how much reasoning is possible.
I’ve yet to see anything close to the level of evidence needed to support the claim.
To say LLMs as a class is architecturally able to be trained to reason is - in the complete absence of evidence to suggest humans can compute functions outside the Turing computable - is effectively only an argument that they can implement a minimal Turing machine given the context is used as IO. Given the size of the rules needed to implement the smallest known Turing machines, it'd take a really tiny model for them to be unable to.
Now, you can then argue that it doesn't "count" if it needs to be fed a huge program step by step via IO, but if it can do something that way, I'd need some really convincing evidence for why the static elements those steps could not progressively be embedded into a model.
Now consider that you can trivially show that you can get an LLM to "execute" on step of a Turing machine where the context is used as an IO channel, and will have shown it to be Turing complete.
> I think this structure makes it quite hard to really say how much reasoning is possible.
Given the above, I think any argument that they can't be made to reason is effectively an argument that humans can compute functions outside the Turing computable set, which we haven't the slightest shred of evidence to suggest.
What evidence do you have for either of these, since I don't recall any proof that "functions computable by Turing machines" is equal to the set of functions that can exist. And I don't recall pretrained llms being proven to be Turing machines.
As it stands, Church, Turing, and Kleene have proven that the set of generally recursive functions, the lambda calculus, and the Turing computable set are equivalent, and no attempt to categorize computable functions outside those sets has succeeded since.
If you want your name in the history books, all you need to do is find a single function that humans can compute that a is outside the Turing computable set.
As for LLMs, you can trivially test that they can act like a Turing machine if you give them a loop and use the context to provide access to IO: Turn the temperature down, and formulate a prompt to ask one to follow the rules of the simplest known Turing machine. A reminder that the simplest known Turing machine is a 2-state, 3-symbol Turing machine. It's quite hard to find a system that can carry out any kind of complex function that can't act like a Turing machine if you allow it to loop and give it access to IO.
I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.
I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:
> Brain studies show that language is not essential for the cognitive processes that underlie thought.
> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...
> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...
> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.
https://www.scientificamerican.com/article/you-dont-need-wor...
The right hemisphere almost certainly uses internal 'language' either consciously or unconsciously to define objects, actions, intent.. the fact that they passed these tests is evidence of that. The brain damage is simply stopping them expressing that 'language'. But the existence of language was expressed in the completion of the task..
While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.
Personally, I think such beliefs about concepts like consciousness, free will, qualia and emotions emerge from how the human brain includes a simplified version of itself when setting up a world model. In fact, I think many such elements are pretty much hard coded (by our genes) into the machinery that human brains use to generate such world models.
Indeed, if this is true, concepts like consciousness, free will, various qualia and emotions can in fact be considered "symbols" within this world model. While the full reality of what happens in the brain when we exercise what we represent by "free will" may be very complex, the world model may assign a boolean to each action we (and others) perform, where the action is either grouped into "voluntary action" or "involuntary action".
This may not always be accurate, but it saves a lot of memory and compute costs for the brain when it tries to optimize for the future. This optimization can (and usually is) called "reasoning", even if the symbols have only an approximated correspondence with physical reality.
For instance, if in our world model somebody does something against us and we deem that it was done exercising "free will", we will be much more likely to punish them than if we categorize the action as "forced".
And on top of these basic concepts within our world model, we tend to add a lot more, also in symbol form, to enable us to use symbolic reasoning to support our interactions with the world.
Huh.
I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.
If anything, "next token prediction" seems much closer to how human thinking works than anything even remotely formal or symbolic that was proposed before.
As for hardcoding things in world models, one thing that LLMs do conclusively prove is that you can create a coherent system capable of encoding and working with meaning of concepts without providing anything that looks like explicit "meaning". Meaning is not inherent to a term, or a concept expressed by that term - it exists in the relationships between an the concept, and all other concepts.
Indeed, this is one reason why I assert that Wittgenstein was wrong about the nature of human thought when writing:
"""If there were a verb meaning "to believe falsely," it would not have any significant first person, present indicative."""
Sure, it's logically incoherent for us to have such a word, but there's what seems like several different ways for us to hold contradictory and incoherent beliefs within our minds.
Yes. But some place too much confidence in how "rational" their intuition is, including some of the most intelligent minds the world has seen.
Specifically, many operate as if their intuition (that they treat as completely rational) has some kind of supernatural/magic/divine origin, including many who (imo) SHOULD know better.
While I think (like you do) that this intuition has a lot in common with LLM's and other NN architectures than pure logic, or even the scientific method.
Did Gödel really say this? It sounds like quite a stretch of incompleteness theorem.
It's like saying because halting problem is undecidable, but humans can debug programs, therefore human brains must having some supernatural power.
It appears that he was religious and probably believed in an immaterial and maybe even divine soul [2]. If so, that may explain why he believed that human intuition could be unburdend by the incompleteness theorem.
[1] https://philsci-archive.pitt.edu/9154/1/Nesher_Godel_on_Trut...
[2] https://en.wikipedia.org/wiki/G%C3%B6del%27s_ontological_pro...
It's absolutely amazing to see a super-GM (in that case it was Hikaru) see a position, and basically "play-by-play" it from the beginning, to show people how they got in that position. It wasn't his game btw. But later in that same video when asked he explained what I wrote in the first paragraph. It works with proper games, but it rarely works with weird random chess puzzles, as he put it. Or, in other words, chess puzzles that come from real games are much better than "randomly generated", and make more sense even to the best of humans.
I can sort of confirm that. I never learned all the formal theoretical standard chess strategies except for the basic ones. So when playing against really good players, way above my level, I could win sometimes (or allmost) simply by making unconventional (dumb by normal strategy) moves in the beginning - resulting in a non standard game where I could apply pressure in a way the opponent was not prepared for (also they underestimated me after the initial dumb moves). For me, the unconventional game was just like a standard game, I had no routine - but for the experienced one, it was way more challenging. But then of course in the standard situations, to which allmost every chess game evolves to - they destroyed me, simply for experience and routine.
IIRC Magnus Carlsen is said to do something like this as well - he'll play opening lines that are known to be theoretically suboptimal to take his opponent out of prep, after which he can rely on his own prep/skills to give him better winning chances.
You can score points against e.g. national team members who've been 5-0'ing the rest of the pool by doing weird cheap tricks. You won't win though, because after one or two points they will adjust and then wreck you.
And on the flip side, if you're decently rated (B ~ A ish) and are used to just standard fencing, if you run into someone who's U ~ E and does something weird like literally not move their feet, it can take you a couple touches to readjust to someone who doesn't behave normally.
Unlike chess though, in fencing the unconventional stuff only works for a couple points. You can't stretch that into a victory, because after each point everything resets.
Maybe that's why pentathlon (single touch victory) fencing is so weird.
Another thing he noticed is that, when asked to set up a game they were shown earlier, the errors expert players made often were insignificant. For example, they would set up the pawn structure on the king side incorrectly if the game’s action was on the other side of the board, move a bishop by a square in such a way didn’t make a difference for the game, or even add an piece that wasn’t active on the board.
Beginners would make different errors, some of them hugely affecting the position on the board.
Good ones are never randomly generated, however. Also, the skill doesn't fully transfer in either direction between live play and solving chess problems. Definitely not reconstructing the prior state of the board, since there's nothing there to reconstruct.
So yes, everything Hikaru was saying there makes sense to me, but I don't think your last sentence follows from it. Good chess problems come from good chess problem authors (interestingly this included Vladimir Nabokov), they aren't random, but they rarely come from games, and tickle a different part of the brain from live play.
It is false that GMs would have any trouble determining legal moves in randomly generated positions. Indeed, even a 1200 level player on chess.com will find that pretty trivial.
A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
I think it's also significantly harder to play chess if you were to hear a sequence of moves over the phone and had to reply with a followup move, with no space or time to think or talk through moves.
I would tend to agree that there's a big difference between attempting to make a move that's illegal because of the state of a different region of the board, and attempting to make one that's illegal because of the identity of the piece being moved, but if your only category of interest is "illegal moves", you can't see that difference.
Software that knows the rules of the game shouldn't be making either mistake.
I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.
The article doesn't say how often the LLM fails to generate legal moves in ten tries, but it can't be often or the level of play would be much much much worse.
As seems often the case, the LLM seems to have a brilliant intuition, but no precise rigid "world model".
Of course words like intuition are anthropomorphic. At best a model for what LLMs are doing. But saying "they don't understand" when they can do _this well_ is absurd.
Since we already have programs that can do this, that definitely aren’t really thinking and don’t “understand” anything at all, I don’t see the relevance of this part.
However, since an LLM is a generalist engine, if it understands chess there is no reason for it not to understand millions of other concepts and how they relate to each other. And this is the kind of understanding that humans do.
When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time. This is chess, not something squishy like literary criticism. There’s no need to debate semantics at all. One illegal move is a deal breaker
Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
We can argue whether an error rate of 1 in a million means that it plays like a grandmaster or a novice, but that’s less interesting. It failed to model a simple system correctly, and a much shorter/simpler program could do that. Doesn’t seem smart if our response to this as an industry is to debate semantics, ignore the issue, and work feverishly to put it to work modeling more complicated / critical systems.
Here are things people say:
Magnus Carlsen has a better understanding of chess than I do. (Yet we both know the precise rules of the game.)
Grandmasters have a very deep understanding of Chess, despite occasionally making illegal moves that are not according to the rules (https://www.youtube.com/watch?v=m5WVJu154F0).
"If AlphaZero were a conventional engine its developers would be looking at the openings which it lost to Stockfish, because those indicate that there's something Stockfish understands better than AlphaZero." (https://chess.stackexchange.com/questions/23206/the-games-al...)
> Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.
How exactly is this relevant to the question whether LLMs can be said to have some understanding of chess? Can they consistently apply the rules when game states are given in pgn? No. _Very_ few humans without specialized training could either (without using a board as a tool to keep track of the implicit state). They certainly "know" the rules (even if they can't apply them) in the sense that they will state them correctly if you ask them to.
I am not particularly interested in "the industry". It's obvious that if you want a system to play chess, you use a chess engine, not an LLM. But I am interested in what their chess abilities teaches us about how LLMs build world models. E.g.:
> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:
Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.
> I am interested in what their chess abilities teaches us about how LLMs build world models
Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.
Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?
Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".
The system understands nothing, it's anthropomorphising it to say it does.
It seems like many people tend to use the word "understand" to that not only does someone believe that a given move is good, they also belive that this knowledge comes from a rational evaluation.
Some attribute this to a non-material soul/mind, some to quantum mechanics or something else that seems magic, while others never realized the problem with such a belief in the first place.
I would claim that when someone can instantly recognize good moves in a given situation, it doesn't come from rationality at all, but from some mix of memory and an intuition that has been build by playing the game many times, with only tiny elements of actual rational thought sprinkled in.
This even holds true when these people start to calculate. It is primarily their intuition that prevens them from spending time on all sorts of unlikely moves.
And this intuition, I think, represents most of their real "understanding" of the game. This is quite different from understanding something like a mathematical proof, which is almost exclusively inducive logic.
And since "understand" so often is associated with rational inductive logic, I think the proper term would be to have "good intuition" when playing the game.
And this "good intuition" seems to me precisely the kind of thing that is trained within most neural nets, even LLM's. (Q*, AlphaZero, etc also add the ability to "calculate", meaning traverse the search space efficiently).
If we wanted to measure how good this intuition is compared to human chess intuition, we could limit an engine like AlphaZero to only evaluate the same number of moves per second that good humans would be able to, which might be around 10 or so.
Maybe with this limitation, the engine wouldn't currently be able to beat the best humans, but even if it reaches a rating of 2000-2500 this way, I would say it has a pretty good intuitive understanding.
Suppose it tries to capture en passant. How do you know whether that's legal?
It isn't even wrong.
Said differently in case I phrased that poorly - couldn't the LLM still learn the it only ever saw bishops move diagonally and therefore only considering those moves without actually reasoning through the concept of legal and illegal moves?
1. Explain the current board position and the plan going forwards, before proposing a move. This lets the model actually think more, kind of like o1, but here it would guarantee a more focused processing.
2. Actually draw the ascii board for each step. Hopefully producing more valid moves since board + move is easier to reliably process than 20×move.
These ideas were proven to work very well in the ReAct paper (and by extension, the CoT Chain of Thought paper). Could also extend this by asking it to do this N times and stop when we get the same answer a majority of times (this is an idea stolen from the CoT-SC paper, chain of through self-consistency).
I doubt that this is going to make much difference. 2D "graphics" like ASCII art are foreign to language models - the models perceive text as a stream of tokens (including newlines), so "vertical" relationships between lines of text aren't obvious to them like they would be to a human viewer. Having that board diagram in the context window isn't likely to help the model reason about the game.
Having the model list out the positions of each piece on the board in plain text (e.g. "Black knight at c5") might be a more suitable way to reinforce the model's positional awareness.
However, as you point out, the way we feed these models especially make them vertically challenged, so to speak. This makes them unable to reliably identify vertically separated components in a circuit for example.
With combined vision+text models becoming more common place, perhaps running the rendered text input through the vision model might help.
The relative rarity of this representation in training data means it would probably degrade responses rather than improve them. I'd like to see the results of this, because I would be very surprised if it improved the responses.
RE 1., definitely worth trying, and there's more variants of such tricks specific to models. I'm out of date on OpenAI docs, but with Anthropic models, the docs suggest using XML notation to label and categorize most important parts of the input. This kind of soft structure seems to improve the results coming from Claude models; I imagine they specifically trained the model to recognize it.
See: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
In author's case, for Anthropic models, the final prompt could look like this:
<role>You are a chess grandmaster.</role>
<instructions>
You will be given a partially completed game, contained in <game-log> tags.
After seeing it, you should repeat the ENTIRE GAME and then give ONE new move
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
ALWAYS repeat the entire representation of the game so far, putting it in <new-game-log> tags.
Before giving the new game log, explain your reasoning inside <thinking> tag block.
</instructions>
<example>
<request>
<game-log>
*** example game ***
</game-log>
</request>
<reply>
<thinking> *** some example explanation ***</thinking>
<new-game-log> *** game log + next move *** </new-game-log>
</reply>
</example>
<game-log>
*** the incomplete game goes here ***
</game-log>
This kind of prompting is supposed to provide noticeable improvement for Anthropic models. Ironically, I only discovered it few weeks ago, despite having been using Claude 3.5 Sonnet extensively for months. Which goes to say, RTFM is still a useful skill. Maybe OpenAI models have similar affordances too, simple but somehow unnoticed? (I'll re-check the docs myself later.)My guess would be that the persona of the openAI team on platforms like Twitter is very cliquey. This, I think, naturally leads to mistrust. A clique feels more likely to cheat than some other sort of group.
* disclaimer - only n=7 on o1. Others are like 100-300 each
Something weird is happening with LLMs and Chess
If you have a problem and all of your potential solutions are unlikely, then it's fine to assume the least unlikely solution while acknowledging that it's statistically probable that you're also wrong. IOW if you have ten potential solutions to a problem and you estimate that the most likely solution has an 11% chance of being true, it's fine to assume that solution despite the fact that, by your own estimate, you have an 89% chance of being wrong.
The "OpenAI is secretly calling out to a chess engine" hypothesis always seemed unlikely to me (you'd think it would play much better, if so), but it seemed the easiest solution (Occam's razor) and I wouldn't have been surprised to learn it was true (it's not like OpenAI has a reputation of being trustworthy).
Well, the failed revolution from last year combined with the non-profit bait-and-switch pretty much conclusively proved that OpenAI researchers are in it for the money first and foremost, and pride has a dollar value.
And how does that prove anything about their motivations "first and foremost"? They could be in it because they like the work itself, and secondary concerns like open or not don't matter to them. There's basically infinite interpretations of their motivations.
Why not? Stop calling it "the entire company colluding and lying" and start calling it a "messaging strategy among the people not prevented from speaking by NDA." That will pass a casual Occam's test that "lying" failed. But they both mean the same exact thing.
Occam's test applies to the full proposal, including the explanation of things outlined above.
In my opinion, it only seems like the easiest solution on the surface taking basically nothing into account. By the time you start looking at everything in context, it just seems bizarre.
The issue is that there might be more to reason than appearing to reason. We just don't know. I'm not sure how it's apparently so unknown or unappreciated by people in the computer world, but there are major unresolved questions in science and philosophy around things like thinking, reasoning, language, consciousness, and the mind. No amount of techno-optimism can change this fact.
The issue is we have not gotten further than more or less educated guesses as to what those words mean. LLMs bring that interesting fact to light, even providing humanity with a wonderful nudge to keep grappling with these unsolved questions, and perhaps make some progress.
To be clear, they certainly are sometimes passably good when it comes to summarising selectively and responsively the terabytes and terabytes of data they've been trained on, don't get me wrong, and I am enjoying that new thing in the world. And if you want to define reason like that, feel free.
Look, you can put as many underscores as you like, the question of whether these machines are really reasoning or emulating reason is not a solved problem. We don't know what reasoning is! We don't know if we are really reasoning, because we have major unresolved questions regarding the mind and consciousness[1].
These may not be intractable problems either, there's reason for hope. In particular, studying brains with more precision is obviously exciting there. More computational experiments, including the recent explosion in LLM research, is also great.
Still, reflexively believing in the computational theory of the mind[2] without engaging in the actual difficulty of those questions, though commonplace, is not reasonable.
[0] Jozarov on YT has great commentary of top engine games, worth checking out.
Chess engines can reason about chess (they can even explain their reasoning). LLMs can reason about many other things, with varied efficiency.
What everyone is currently trying to build is something like AlphaZero (adversarial self-improvement for superhuman performance) with the state space of LLMs (general enough to be useful for most tasks). When we’ll have this, we’ll have AGI.
Anything else is just an argument of semantics. The idea that there is "true" reasoning and "fake" reasoning but that we can't tell the latter apart from the former is ridiculous.
You can't eat your cake and have it. Either "fake reasoning" is a thing and can be distinguished or it can't and it's just a made up distinction.
The look up table is the same. It will fall apart with numbers above 100. That's the distinction.
People need to start bringing up the supposed distinction that exists with LLMs instead of nonsense examples that don't even pass the test outlined.
Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.
Using regurgitation to get around the assistant/user token separation is another fun tool for the toolbox, relevant for whenever you want a model that doesn't support continuation actually perform continuation (at the cost of a lot of latency).
I wonder if any type of reflection or chains of thought would help it play better. I wouldn't be surprised if getting the LLM to write an analysis of the game in English is more likely to move it out of distribution than to make it pick better chess moves.
This is still my impression of LLMs in general. It's amazing that they work, but for the next tech disruption, I'd appreciate something that doesn't make you feel like being in a bad sci-fi movie all the time.
This is someone with limited knowledge of chess, statistics and LLMs doing a series of public articles as they learn a little tiny bit about chess, statistics and LLMs. And it garners upvotes and attention off the coat-tails of AI excitement. Which is fair enough, it's the (semi-)public internet, but it sort of masquerades as being half-serious "research", and it kind of held things together for the first article, but this one really is thrown together to keep the buzz going of the last one.
The TL;DR :: one of the AIs being just-above-terrible, compared to all the others being completely terrible, a fact already of dubious interest, is down to - we don't know. Maybe a difference in training sets. Tons of speculation. A few graphs.
I doubt it's doing much more than a static analysis of the a board position, or even moving based mostly on just a few recent moves by key pieces.
Maybe less morally challenging, as well. You wouldn't be trying to install "sentience".
Considering that training models on code seems to improve their abilities on non-coding tasks in actual testing, this isn't even all that far-fetched. Perhaps that is why GPT-3.5 was specifically trained on chess in the first place.
To get to the bottom of this it would be interesting to train LLMs on nothing but chess games (can synthesize them endlessly by having Stockfish play against itself) with maybe a side helping of chess commentary and examples of chess dialogs “how many pawns are on the board?”, “where are my rooks?”, “draw the board”, competence at which would demonstrate that it has a representation of the board.
I don’t believe in “emergent phenomena” or that the general linguistic competence or ability to feign competence is necessary for chess playing (being smart at chess doesn’t mean you are smart at other things and vice versa). With experiments like this you might prove me wrong though.
This paper came out about a week ago
https://arxiv.org/pdf/2411.06655
seems to get good results with a fine-tuned Llama. I also like this one as it is about competence in chess commentary
I suspect chess skill is completely useless for LLMs in general and not an emergent phenomenon, just consuming gradient bandwidth and parameter space to do this neat trick. This is clear to me because the LLMs that aren't trained specifically on chess do not do chess well.
A difference between communication and chess is that your partner in conversation is your ally in meaning making and will help fix your mistakes which is how they get away with bullshitting. ("Personality" makes a big difference, by the time you are telling your programming assistant "Dude, there's a red squiggle on line 92" you are under its spell)
Chess on the other hand is adversarial and your mistakes are just mistakes that your opponent will take advantage of. If you make a move and your hunch that your pieces are not in danger is just slightly wrong (one piece in danger) that's almost as bad as having all your non-King pieces in danger (they can only take one next turn.)
Why are you manually guessing ways to improve this? Why not let the LLMs do this for themselves and find iteratively better prompts?
This is extremely interesting. In this specific case at least, simply giving examples is equivalent to fine-tuning. This is a great discovery for me, I'll try using examples more often.
I can't explain why.I always had the intuition that fine-tuning was overrated.
One reason perhaps is that examples are "right there" and thus implicitly weighted much more in relation to the fine-tuned neurons.
While it is not very important for this toy case, it's good to keep in mind that each provided example in the input will increase the prediction time and cost compared to fine-tuning.
Is this implicit in the "you are a grandmaster chess player" prompt?
Is there some part of the LLM training that does "if this is a game, then I will always try to win"?
Could the author improve the LLM's odds of winning just by telling it to try and win?
In almost all examples and explanations it has seen from chess games, each player would be trying to win, so it is simply the most logical thing for it to make a winning move. So I wouldn't expect explicitly prompting it to win to improve its performance by much if at all.
The reverse would be interesting though, if you would prompt it to make losing/bad moves, would it be effective in doing so, and would the moves still be mostly legal? That might reveal a bit more about how much relies on concepts it's seen before.
That way you're trying to emulate cases where someone is trying, but isn't very good yet, versus trying to emulate cases where someone is clearly and intentionally losing which is going to be orders of magnitude less common in the training data. (And I also would bet "losing" is also a vector/token too closely tied to ANY losing game, but those players were still putting up a fight to try and win the game. Could still drift towards some good moves!)
Even if the pool was poisoned by games in which some players are trying to lose (probably insignificant), no one annotates player intent in chess games, and so prompting it to win or lose doesn't let the LLM pick up on this.
You can try this by asking an LLM to play to lose. ChatGPT ime tries to set itself up for scholar's mate, but if you don't go for it, it will implicitly start playing to win (e.g. taking your unprotected pieces). If you ask it "why?", it gives you the usual bs post-hoc rationalization.
There are drawn and loosing games in the training set though.
It would be fairly hilarious if the reinforcement training has made the LLM unwilling to make the human feel bad through losing a game.
I wonder if there are variants that have good baselines. It might be tough to evaluate vis a vis human performance on novel games..
I believe the reason why such models were later deprecated was "alignment".
1. They generate chess games from chess engine self play and add that to the training data (similar to the already-stated theory about their training data).
2. They have added chess reinforcement learning to the training at some stage, and actually got it to work (but not very well).
Where's the source for this? What's the reasoning? I don't see it. I have just relooked, and stil l can't see it.
Is it 1800 lichess "Elo", or 1800 FIDE, that's being claimed? And 1800 at what time control? Different time controls have different ratings, as one would imagine/hope the author knows.
I'm guessing it's not 1800 FIDE, as the quality of the games seems far too bad for that. So any clarity here would be appreciated.
What am I missing? Where does it show there how the claim of "1800 ELO" is arrived at?
I can see various things that might be relevant, for example, the graph where it (GPT-3.5-turbo-instruct) is shown as going from mostly winning to mostly losing when it gets to Stockfish level 3. It's hard (/impossible) to estimate the lichess or FIDE ELO of the different Stockfish levels, but Lichess' Stockfish on level 3 is miles below 1800 FIDE, and it seems to me very likely to be below lichess 1800.
I invite any FIDE 1800s and (especially) any Lichess 1800s to play Stockfish level 3 and report back. Years ago when I played a lot on Lichess I was low 2000s in rapid, and I win comfortably up till Stockfish level 6, where I can win, but also do lose sometimes. Basically I really have to start paying attention at level 6.
Level 3 seems like it must be below lichess 1800, but it's just my anecdotal feeling of the strengths. Seeing as how the article is chocabloc full of unfounded speculation and bias though, maybe we can indulge ourselves too.
So: someone please explain the 1800 thing to me? And any lichess 1800s like to play guinea pig, and play a series of games against stockfish 3, and report back to us?
https://arxiv.org/abs/2402.04494
Admittedly, this isn't really "the source" though. The first people to break the news on turbo-instruct's chess ability all pegged it around 1800. https://x.com/GrantSlatton/status/1703913578036904431
At least the arxiv paper is serious:
> A direct comparison between all engines comes with a lot of caveats since some engines use the game history, some have very different training protocols (i.e., RL via self-play instead of supervised learning), and some use search at test time. We show these comparisons to situate the performance of our models within the wider landscape, but emphasize that some conclusions can only be drawn within our family of models and the corresponding ablations that keep all other factors fixed.
It's claimed that this model "understands" chess, and can "reason", and do "actual logic" (here in the comments).
I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn't really happen.
Is there a link to the games where the illegal moves are made?
You have to give this human the same log of the game to refer to.
What might be interesting is to see if there was some sort of prompt the LLM could use to help itself; e.g., "After repeating the entire game up until this point, describe relevant strategic and tactical aspects of the current board state, and then choose a move."
Another thing that's interesting is the 1800 ELO cut-off of the training data. If the cut-off were 2000, or 2200, would that improve the results?
Or, if you included training data but labeled with the player's ELO, could you request play at a specific ELO? Being able to play against a 1400 ELO computer that made the kind of mistakes a 1400 ELO human would make would be amazing.
It looks like they have 3 public bots on lichess.org: 1100, 1500, and 1900
As long as the LLM is a black box, its entirely possible that (a) the LLM does reason through the rules and understands what moves are legal or (b) was trained on a large set of legal moves and therefore only learned to make legal moves. You can claim either case is the real truth, but we have absolutely no way to know because we have absolutely no way to actually understand what the LLM was "thinking".
https://thegradient.pub/othello/
Associated paper: https://arxiv.org/abs/2210.13382
What is difficult is finding some intermediate pattern in between there which we can label with an abstraction that is compatible with human understanding. It may not exist. For example, it may be more like how our brain works to produce language than it is like a logical rule based system. We occasionally say the wrong word, skip a word, spell things wrong...violate the rules of grammar.
The inputs and outputs of the model are human language, so at least there we know the system as a black box can be characterized, if not understood.
This is actually where the AI safety debates tend to lose. From where I sit we can't characterize the black box itself, we can only characterize the outputs themselves.
More specifically, we can decide what we think the quality of the output for the given input and we can attempt to infer what might have happened in between. We really have no idea what happened in between, and though many of the "doomers" raise concerns that seem far fetched, we have absolutely no way of understanding whether they are completely off base or raising concerns of a system that just hasn't shown problems in the input/output pairs yet.
How can you learn to make legal moves without understanding what moves are legal?
If I only see legal moves, I may not think outside the box come up with moves other than what I already saw. Humans run into this all the time, we see things done a certain and effectively learn that that's just how to do it and we don't innovate.
Said differently, if the generative AI isn't actually being generative at all, meaning its just predicting based on the training set, it could be providing only legal moves without ever learning or understanding the rules of the game.
I fear this is not the case: 1) Either, the LLM (or other forms of deep neural networks) can reproduce exactly what it saw, but nothing new (then it would only produce legal moves, if it was trained on only legal ones) 2) Or, the LLM can produce moves that it did not exactly see, by outputting the "most probable" looking move in that situation (which it never has seen before). In effect, this is combining different situations and their output into a new output. As a result of this "mixing", it might output an illegal move (= the output move is illegal in this new situation), despite having been trained on only legal moves.
In fact, I am not even sure if the deep neuronal networks we use in practice even can replicate their training data exactly - it seems to me that there is some kind of compression going on by embedding knowledge into the network, which will come with a loss.
I am deeply convinced that LLMs will never be exact technology (but LLMs + other technology like proof assistants or compilers might be)
The question I have is whether the LLM might be reproducing mostly legal moves only because it was trained on a set of data that itself only included legal moves. The training data would have only helped predict legal moves, and any illegal moves it predicts may very well be because the LLMs are design with random variables as part of the prediction loop.
To me that doesn't seem unreasonable and has nothing to do with irrationally going in circles, curious if you disagree though.
How well they learn completely novel tasks (fail in conversation, pass with training). How well they do complex tasks (debated just look at this thread). How generally knowledgeable they are (pass). How often they do non sensical things (fail).
So IMO it really comes down if you’re judging by peak performances or minimum standards. If I had an employee that preformed as well as an LLM I’d call them an idiot because they needed constant supervision for even trivial tasks, but that’s not the standard everyone is using.
That's totally fair. I expect that to continue to work well when kept in the context of something/someone else that is roughly as intelligent as you are. Bonus points for the fact that one human understands what it means to be human and we all have roughly similar experiences of reality.
I'm not so sure if that kind of judging intelligence by feel works when you are judging something that is (a) totally different from your or (b) massively more (or less) intelligent than you are.
For example, I could see something much smarter than me as acting irrationally when in reality they may be working with a much larger or complex set of facts and context that don't make sense to me.
To me, this means that it absolutely doesn't matter whether LLM does reason or not.
Trying to castle through check is one that occasionally happens to me (I am rated 1800 on lichess).
That doesn't change that it's an illegal move.
Accidentally moving into check is probably the most common illegal move. Castling though check is surprisingly common, too. Actually moving a piece incorrectly is fairly rare, though. I remember one tournament where one of the matches ended in a DQ because one of the players had two white bishops.
In this case, of course, someone moved their bishop from black to white and their opponent didn't catch it until awhile later.
This is somewhat imprecise (or inaccurate).
A quick search on YouTube for "GM illegal moves" indicates that GMs have made illegal moves often enough for there to be compilations.
e.g. https://www.youtube.com/watch?v=m5WVJu154F0 -- The Vidit vs Hikaru one is perhaps the most striking, where Vidit uses his king to attack Hikaru's king.
I have to admit it's been a while since I played chatgpt so maybe it improved.
Whereas the LLM makes "moves" that clearly indicate no ability to play chess: moving pieces to squares well outside their legal moveset, moving pieces that aren't on the board, etc.
What if he makes mistakes that a seeing person would never make?
Does that mean that the blind man is not capable of sculpting at all?
Do you have any evidence of that? TFA doesn't talk about the nature of these errors.
If I were to take that sentence literally, I would ask for at least 199 other examples, but I imagine that it was just a figure of speech. Nevertheless, if that's only one player complaining (even several times), can we really conclude that ChatGPT cannot play? Is that enough evidence, or is there something else at work?
I suppose indeed one could, if one expected an LLM to be ready to play out of the box, and that would be a fair criticism.
But the most interesting and thought-provoking one in there is [1] Carlsen v Inarkiev (2017). Carlsen puts Inarkiev in check. Inarkiev, instead of making a legal move to escape check, does something else. Carlsen then replies to that move. Inarkiev challenges: Carlsen's move was illegal, because the only legal "move" at that point in the game was to flag down an arbiter and claim victory, which Carlsen didn't!
[1] - https://www.youtube.com/watch?v=m5WVJu154F0&t=7m52s
The tournament rules at the time, apparently, fully covered the situation where the game state is legal but the move is illegal. They didn't cover the situation where the game state was actually illegal to begin with. I'm not a chess person, but it sounds like the tournament rules may have been amended after this incident to clarify what should happen in this kind of situation. (And Carlsen was still declared the winner of this game, after all.)
LLM-wise, you could spin this to say that the "rational grandmaster" is as fictional as the "rational consumer": Carlsen, from an actually invalid game state, played "a move that may or may not be illegal just because it sounds kinda “chessy”," as zoky commented below that an LLM would have done. He responded to the gestalt (king in check, move the king) rather than to the details (actually this board position is impossible, I should enter a special case).
OTOH, the real explanation could be that Carlsen was just looking ahead: surely he knew that after his last move, Inarkiev's only legal moves were harmless to him (or fatalistically bad for him? Rxb7 seems like Inarkiev's correct reply, doesn't it? Again I'm not a chess person) and so he could focus elsewhere on the board. He merely happened not to double-check that Inarkiev had actually played one of the legal continuations he'd already enumerated in his head. But in a game played by the rules, he shouldn't have to double-check that — it is already guaranteed by the rules!
Anyway, that's why Carlsen v Inarkiev struck me as the most thought-provoking illegal move, from a computer programmer's perspective.
I would say the analogy is more like someone saying chess moves aloud. So, just as we all misspeak or misspell things from time to time, the model output will have an error rate.
People are really misunderstanding things here. The one model that can actually play at lichess 1800 Elo does not need any of those and will play thousands of moves before a single illegal one. But he isn't just testing that one specific model. He is testing several models, some of which cannot reliably output legal moves (and as such, this logic is required)
The bizarre intellectual quadrilles people dance to sustain their denial of LLM capabilities will never cease to amaze me.
Yet when LLM, trained on corpus of human stupidity, no less, make illegal moves in chess, our brain immediately goes: I don't make illegal moves in chess, how can computer play chess if it does?
Perfect examples of metacognitive bias and general attribution error at least.
"Look! It made mistakes, therefore it's definitely not reasoning!"
That's certainly not what I'm saying, anyway. I was responding to the argument actually being made by many here, which is:
"Look! It plays pretty poorly, but not totally crap, and it wasn't trained for playing just-above-poor chess, therefore, it understands chess and definitely is reasoning!"
I find this - and much of the surrounding discussion - to be quite an amazing display of people's biases, myself. People want to believe LLMs are reasoning, and so we're treated to these merry-go-round "investigations".
But more than being able to make moves, if we claim it understands chess shouldn't be able to explain why it chose a move over another move?
You can divide reasoning into three levels:
1) Can't reason - just regurgitates from memory
2) Can reason, but makes mistakes
3) Always reasons perfectly, never makes mistakes
If an LLM makes mistakes, you've proven that it doesn't reason perfectly.
You haven't proven that it can't reason.
But: Plenty of people struggle with playing along with Socrates' method. Can they not reason at all?
I hope you can translate the conversation of philosophist vs chatgpt from Russian [1]. The conversation from the philosopher is built due to Socrates' method but chatgpt can not even react consistently.
> Plenty of people struggle with playing along with Socrates' method. Can they not reason at all?
I do not hold the opinion that chatgpt "struggles" with Socrates' method, I am clearly seing it can not use it at all even from answering side of Socrates' dualogue which is not that hard. Chatgpt can not use Socrates' method from questioning side of dialogue by design because it never asks questions.
[1] https://hvylya.net/analytics/268340-dialogi-sergeya-dacyuka-...
Without a board to look at, just with the same linear text input given in the prompt? I bet a lot of amateurs would not give you legal moves. No drawing or side piece of paper allowed.
Or alternatively - if chat tuning diminishes some of the models' capability, would it make sense to have a smaller chat model prompt a large base model, and convert back the outputs?
The big breakthrough of GPT was exactly that. You can train a model with (for what that time was) stupidly high amount of data and make it okis to a lot of task you haven't trained explicitly.
With newer techniques, such as chain of thought and self-checking, you can also generate a ton of high-quality training data, that won't degrade the output of the LLM. Though the degree to which you can do that is not clear to me.
Imo it makes sense to train an LLM as a chatbot from the start.
Well, not everyone. I wasn't the only one to mention this, so I'm surprised it didn't show up in the list of theories, but here's e.g. me, seven days ago (source https://news.ycombinator.com/item?id=42145710):
> At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training.
This is not the same thing as cheating/replacing the LLM output, the theory that's mentioned and debunked in the article. And now the follow-up adds weight to this guess:
> Here’s my best guess for what is happening: ... OpenAI trains its base models on datasets with more/better chess games than those used by open models. ... Meanwhile, in section A.2 of this paper (h/t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800.
To me, it makes complete sense that OpenAI would "spike" their training data with data for tasks that people might actually try. There's nothing unethical about this. No dataset is ever truly "neutral", you make choices either way, so why not go out of your way to train the model on potentially useful answers?
OpenAI just shifted their training targets, initially they thought Chess was cool, maybe tomorrow they think Go is cool, or maybe the ability to write poetry. Who knows.
But it seems like the simplest explanation and makes the most sense.
Maybe that'll be enough moat to save us from AGI.
'Chess' is not a standard LLM benchmark worth Goodharting; OA has generally tried to solve problems the right way rather than by shortcuts & cheating, and the GPTs have not heavily overfit on the standard benchmarks or counterexamples that they so easily could which would be so much more valuable PR (imagine how trivial it would be to train on, say, 'the strawberry problem'?), whereas some other LLM providers do see their scores drop much more in anti-memorization papers; they have a clear research use of their own in that very paper mentioning the dataset; and there is some interest in chess as a model organism of supervision and world-modeling in LLMs because we have access to oracles (and it's less boring than many things you could analyze), which explains why they would be doing some research (if not a whole lot). Like the bullet chess LLM paper from Deepmind - they aren't doing that as part of a cunning plan to make Gemini cheat on chess skills and help GCP marketing!
I would guess GPT-4o isn't first pre-trained and then instruct-tuned, but trained directly with refined instruction-following material.
This material probably contains way fewer chess games.
Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.
If you make an illegal move and the opponent doesn't notice it, you gain a significant advantage. LLMs just have David Sirlin's "Playing to Win" as part of their training data.
1. having a human expert creating every answer
or
2. having an expert check 10 answers each of which have a 90% chance of being right and then manually redoing the one which was wrong
Now add a complications that:
• option 1 also isn't 100% correct
• nobody knows which things in option 2 are correlated or not and if those are or aren't correlated with human errors so we might be systematically unable to even recognise the errors
• even if we could, humans not only get lazy without practice but also get bored if the work is too easy, so a short-term study in efficiency changes doesn't tell you things like "after 2 years you get mass resignations by the competent doctors, while the incompetent just say 'LGTM' to all the AI answers"
It's kind of like humans.
But this math analogy is not quite appropriate: there's abstract math and arithmetic. A good math practitioner (LLM or human) can be bad at arithmetic, yet good at abstract reasoning. The later doesn't (necessarily) requires the former.
In chess, I don't think that you can build a good strategy if it relies on illegal moves, because tactics and strategies are tied.
Applying a corrective script to weed out bad answers is also not "one-shot" solving anything, so I would call your example an elaborate guessing machine. That doesn't mean it's not useful, but that's not how a human being does maths, when they understand what they're doing - in fact you can readily program a computer to solve general maths problems correctly the first time. This is also exactly the problem with saying that LLMs can write software - a series of elaborate guesses is undeniably useful and impressive, but without a corrective guiding hand, ultimately useless, and not demonastrating generalised understanding of the problem space. The dream of AI is surely that the corrective hand is unnecessary?
I was once asked by one of the Clueless Admin types if we couldn't just "fix" various sites such that people couldn't input anything wrong. Same principle.
An LLM that recognizes an input as "math" and calls out to a NON-LLM to solve the problem vs an LLM that recognizes an input as "math" and also uses next-token prediction to produce an accurate response ARE DIFFERENT.
Computationally it's trivial to detect illegal moves, so it's nothing like filtering out incorrect medical advice.
Hardware that accurately performs maths faster than all of humanity combined is so cheap as to be disposable, but I've yet to see anyone claim that a Pi Zero has "understanding" of anything.
An LLM can display the viva voce approach that Turing suggested[0], and do it well. Ironically for all those now talking about "stochastic parrots", the passage reads:
"""… The game (with the player B omitted) is frequently used in practice under the name of viva voce to discover whether some one really understands something or has ‘learnt it parrot fashion’. …"
Showing that not much has changed on the philosophy of this topic since it was invented.
[0] https://academic.oup.com/mind/article/LIX/236/433/986238
I'll have a stab at it. The idea of LLMs 'understanding' maths is that, once having been trained on a set of maths-related material, the LLM will be able to generalise to solve other maths problems that it hasn't encountered before. If an LLM sees all the multiplication tables up to 10x10, and then is correctly able to compute 23x31, we might surmise that it 'understands' multiplication - i.e. that it has built some generalised internal representation of what multiplication is, rather than just memorising all possible answers. Obviously we don't expect generalisation from a Pi Zero without specifically being coded for it, because it's a fixed function piece of hardware.
Personally I think this is highly unlikely given that maths and natural language are very different things, and being good at the latter does not bear any relationship to being good at the former (just ask anyone who struggles with maths - plenty of people do!). Not to mention that it's also much easier to test for understanding of maths because there is (usually!) a single correct answer regardless of how convoluted the problem - compared to natural language where imitation and understanding are much more difficult to tell apart.
Obviously not, but that is tangential to this discussion, I think. A hammer might be a useful tool in certain situations, and surely it does not replace a human (but it might make a human in those situations more productive, compared to a human without a hammer).
> generating new ideas
Is brainstorming not an instance of generating new ideas? I would strongly argue so. And whether the LLM does "understand" (or whatever ill-defined, ill-measurable concept one wants to use here) anything about the ideas if produces, and how they might be novel - that is not important either.
If we assume that Tao is adequately assessing the situation and truthfully reporting his findings, then LLMs can, at the current state, at least occasionally be useful in generating new ideas, at least in mathematics.
You're strictly correct, but the rules for chess are infamously hard to implement (as anyone who's tried to write a chess program will know), leading to minor bugs in a lot of chess programs.
For example, there's this old myth about vertical castling being allowed due to ambiguity in the ruleset: https://www.futilitycloset.com/2009/12/11/outside-the-box/ (Probably not historically accurate).
If you move beyond legal positions into who wins when one side flags, the rules state that the other side should be awarded a victory if checkmate was possible with any legal sequence of moves. This is so hard to check that no chess program tries to implement it, instead using simpler rules to achieve a very similar but slightly more conservative result.
Come on. Yeah they're not trivial but they've been done numerous times. There's been chess programs for almost as long as there have been computers. Checking legal moves is a _solved problem_.
Detecting valid medical advice is not. The two are not even remotely comparable.
Uh? Where exactly did I signal my support for LLM's giving medical advice?
(thinking about which rule set is correct would not be meaningful in my opinion - chess is a social construct, with only parts of it being well defined. I would not bother about the rest, at least not when implementing it)
By the way: I read "Computationally it's trivial" as more along the lines of "it has been done before, it is efficient to compute, one just has to do it" versus "this is new territory, one needs to come up with how to wire up the LLM output with an SMT solver, and we do not even know if/how it will work."
Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence required to play at ELO 1800.
(Another angle: a human on Chess.com also has any illegal move they try to make ignored, too.)
That’s exactly what it does. 1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess.
People in this thread are really distracted by the medical analogy so I’ll offer another: you’ve got a bridge that allows millions of vehicles to cross, and randomly falls down if you tickle it wrong, maybe a car of rare color. One key aspect of bridges is that they work reliably for any vehicle, and once they fail they don’t work with any vehicle. A bridge that sometimes fails and sometimes doesn’t isn’t a bridge as much as a death trap.
Highly rated chess players make illegal moves. It's rare but it happens. They don't understand chess ?
Humans with correct models may nevertheless make errors in rule applications. Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.
Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors. In a math or physics class this is roughly the difference between carry-the-one arithmetic errors vs using an equation from a completely wrong domain. The word “understands” is loaded in discussion of LLMs, but everyone knows which mistake is going to get partial credit vs zero credit on an exam.
>Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect or incomplete models.
I don't know why people continue to force the wrong abstraction. LLMs do not work like 'machines'. They don't 'follow rules' the way we understand normal machines to 'follow rules'.
>so when they fail to apply rules correctly, it means they have incorrect or incomplete models.
Everyone has incomplete or incorrect models. It doesn't mean we always say they don't understand. Nobody says Newton didn't understand gravity.
>Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors.
It's not very apparent no. You've just decided it has different causes because of preconceived notions on how you think all machines must operate in all configurations.
LLMs are not the logic automatons in science fiction. They don't behave or act like normal machines in any way. The internals run some computations to make predictions but so does your nervous system. Computation is substrate-independent.
I don't even know how you can make this distinction without seeing what sort of illegal moves it makes. If it makes the sort high rated players make then what ?
- Generally, we do not say someone does not understand just because of a model error. The model error has to be sufficiently large or the model sufficiently narrow. No-one says Newton didn't understand gravity just because his model has an error in it but we might say he didn't understand some aspects of it.
- You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.
> You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.
Here's an anecdotal examination. After much talk about LLMs and chess, and math, and formal logic here's the state of the art, simplified from dialog with gpt today:
> blue is red and red is blue. what color is the sky? >> <blah blah, restates premise, correctly answer "red">
At this point fans rejoice, saying it understands hypotheticals and logic. Dialogue continues..
> name one red thing >> <blah blah, restates premise, incorrectly offers "strawberries are red">
At this point detractors rejoice, declare that it doesn't understand. Now the conversation devolves into semantics or technicalities about prompt-hacks, training data, weights. Whatever. We don't need chess. Just look it, it's broken as hell. Discussing whether the error is human-equivalent isn't the point either. It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.
Are humans broken ? Because our reasoning is a very broken process. You say it's no solid foundation ? Take a look around you. This broken processor is the foundation of society and the conveniences you take for granted.
The vast vast majority of human history, there wasn't anything even remotely resembling a non-broken general reasoner. And you know the funny thing ? There still isn't. When people like you say LLMs don't reason, they hold them to a standard that doesn't exist. Where is this non-broken general reasoner in anywhere but fiction and your own imagination?
>And while there are some exceptions, an unreliable tool/agent is often worse than none at all.
Since you are clearly meaning unreliable to be 'makes no mistake/is not broken' then no human is a reliable agent. Clearly, the real exception is when an unreliable agent is worse than nothing at all.
That's assuming that, somehow, a LLM is a machine. Why would you think that?
I think we are discussing whether LLMs can emulate chess playing machines, regardless of whether they are actually literally composed of a flock of stochastic parrots..
> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.
If this line of reasoning applies to machines, but LLMs aren't machines, how can you derive any of these claims?
"A implies B" may be right, but you must first demonstrate A before reaching conclusion B..
> I think we are discussing whether LLMs can emulate chess playing machines
That is incorrect. We're discussing whether LLMs can play chess. Unless you think that human players also emulate chess playing machines?
And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.
The point is that it plays at a level far far above making random legal moves or even average humans. To say that that doesn't mean anything because it's not perfect is simply insane.
But it actually is safety critical very quickly whenever you say something like “works fine most of the time, so our plan going forward is to dismiss any discussion of when it breaks and why”.
A bridge failure feels like the right order of magnitude for the error rate and effective misery that AI has already quietly caused with biased models where one in a million resumes or loan applications is thrown out. And a nuclear bomb would actually kill less people than a full on economic meltdown. But I’m sure no one is using LLMs in finance at all right?
It’s so arrogant and naive to ignore failure modes that we don’t even understand yet.. at least bridges and steel have specs. Software “engineering” was always a very suspect name for the discipline but whatever claim we had to it is worse than ever.
The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.
Not at all. Like seriously, not in the slightest.
Their representation of the input is also not linear. Transformers use self-attention which relies on the softmax function, which is non-linear.
The internal model of a CPU is linear and fixed. Yet, a CPU can still generate an output which is very different from the input. It is not a simple lookup table, instead it executes complex algorithms.
An LLM has large amounts of input processing power. It has a large internal state. It executes "cycle by cycle", processing the inputs and internal state to generate output data and a new internal state.
So why shouldn't LLMs be capable of executing complex algorithms?
The issue is always inout consumption, and output correctness. In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct. So the issue is that the internal model (for a given task) of LLMs are not consistent enough and the referential window (keeping track of each item in the system) is always too small.
> In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct.
Sure, intelligent design makes for a better design in many ways.
That doesn't mean that an evolved design doesn't work at all, right?
Not really, you can try to make illegal moves in chess, and usually, you are given a time penalty and get to try again, so even in a real chess game, illegal moves are "filtered out".
And for the "medical expert" analogy, let's say that you compare to systems based on the well being of the patients after they follow the advise. I think it is meaningful even if you filter out advise that is obviously inapplicable, for example because it refers to non-existing body parts.
It's beginner chess and beginners make moves at random all the time.
And the article does show various graphs of the badly playing models which will hardly play worse than random but are clearly far below the good models.
I have NO idea why no one seems to do this. It's a similar issue with LLM-as-judge evaluations. Often they are begging to be combined with grammar based/constrained/structured sampling. So much good stuff in LLM land isn't used for no good reason! There are several libraries for implementing this easily, outlines, guidance, lm-format-enforcer, and likely many more. You can even do it now with OpenAI!
Oobabooga text gen webUI literally has chess as one of it's candidate examples of grammar based sampling!!!
If you're arguing that the singularity already happened then your criticism makes perfect sense; these are dumb machines, not useful yet for most applications. If you just want to use the LLM as a tool though, the behavior when you filter out illegal responses (assuming you're able to do so) is the only reasonable metric.
Analogizing to a task I care a bit about: Current-gen LLMs are somewhere between piss-poor and moderate at generating recipes. With a bit of prompt engineering most recipes pass my "bar", but they're still often lacking in one or more important characteristics. If you do nothing other than ask it to generate many options and then as a person manually filter to the subset of ideas (around 1/20) which look stellar, it's both very effective at generating good recipes, and they're usually much better than my other sources of stellar recipes (obviously not generally applicable because you have to be able to tell bad recipes from good at a glance for that workflow to make sense). The fact that most of the responses are garbage doesn't really matter; it's still an improvement to how I cook.
Sharp eyes, similarly Andrew Ng and his Stanford University team pulled the same trick by having overfitting training to testing ratio for his famous cardiologist-level paper published in Nature Medicine [1].
The training ratio is more than 99% and testing less than 1% which failed AI validation 101. The paper would not stand in most AI conference but being published in Nature Medicine, one of the highest impact factor journal there is and has many citations for AI in healthcare and medicine.
[1] Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network:
Regardless of the actual experiment outcome, I think this is a super valuable insight. "Should we provide legal moves?" section is an excellent case study of this- extremely prudent idea actually degrades model performance, and quite badly. It's like that crocodile game where you're pushing teeth until it clamps onto your hand.
So it makes sense that an LLM would also be able to acquire some skill by simply having a large volume of chess games in its training data.
OpenAI probably just eventually decided it wasn't useful to keep pursuing chess skill.
It's still interesting to try to replicate how you would make a generalist LLM good at chess, so i appreciated the post, but I don't think there's a huge mystery!
I can't understand why no research group is going hard at this.
Having said all of that, it wouldn't surprise me if the "language to world model" thesis you reference is indeed wrong. But I don't think a model that plays chess well disproves it, particularly since there are chess engines using old fashioned approaches that utterly destroy LLM's.
Understanding is a funny concept to try to apply to computer programs anyway. But playing from an illegal state seems (to me at least) to indicate something interesting about the ability to comprehend the general idea of chess.
I haven’t tried this yet, but I think you can set up and analyze board positions that are legal but that could never be reached in a real game (e.g, having nine pawns of one color).
Not that I think there's anything inherently unreasonable about an LLM understanding chess, but I think the author missed a variant hypothesis here:
What if that specific model, when it recognizes chess notation, is trained to silently "tag out" for another, more specialized LLM, that is specifically trained on a majority-chess dataset? (Or — perhaps even more likely — the model is trained to recognize the need to activate a chess-playing LoRA adapter?)
It would still be an LLM, so things like "changing how you prompt it changes how it plays" would still make sense. Yet it would be one that has spent a lot more time modelling chess than other things, and never ran into anything that distracted it enough to catastrophically forget how chess works (i.e. to reallocate some of the latent-space vocabulary on certain layers from modelling chess, to things that matter more to the training function.)
And I could certainly see "playing chess" as a good proving ground for testing the ability of OpenAI's backend to recognize the need to "loop in" a LoRA in the inference of a response. It's something LLM base models suck at; but it's also something you intuitively could train an LLM to do (to at least a proficient-ish level, as seen here) if you had a model focus on just learning that.
Thus, "ability of our [framework-mediated] model to play chess" is easy to keep an eye on, long-term, as a proxy metric for "how well our LoRA-activation system is working", without needing to worry that your next generation of base models might suddenly invalidate the metric by getting good at playing chess without any "help." (At least not any time soon.)
> What if that specific model, when it recognizes chess notation, is trained to silently "tag out" for another, more specialized LLM, that is specifically trained on a majority-chess dataset? (Or — perhaps even more likely — the model is trained to recognize the need to activate a chess-playing LoRA adapter?)
Pretty sure your variant hypothesis is sufficiently covered by the author's writing.
So strange that people are so attached to conspiracy theories in this instance. Why would OpenAI or anyone go through all the trouble? The proposals outlined in the article make far more sense and track well with established research (namely that applying RLHF to a "text-only" model tends to wreak havoc on said model).
The author was talking about "calling out to a chess engine" — and then explained why that'd be fragile and why everyone in the company would know if they did something like that.
The variant hypothesis is just that gpt-3.5-turbo was a testbed for automatic domain-specific LoRA adaptation, which wouldn't be fragile and which could be something that everyone in the company could know they did, without having any knowledge of one of their "test LoRAs" for that R&D project happening to involve a chess training data corpus.
The actual argument against the variant hypothesis is something like "why wouldn't we see automatic LoRA adaptation support in newer GPTs, then?" — but that isn't an argument the author makes.
Why conflate the parameters of chess with checkers and go if you already have high quality models for each? I thought tool use and RAG were fair game.
There are several fatal flaws.
The first problem is that he isn't clearly and concisely displaying the current board state. He is expecting the model to attend a move sequence to figure out the board state.
Secondly, he isn't allowing the model to think elastically using COT or other strategies.
Honestly, I am shocked it is working at all. He has basically formulated the problem in the worst possible way.
It may just be coming up with "good moves" that a "grandmaster" would make, but after all, grandmasters still lose 49% of their games, all the while making "good moves"
I would suppose that the LLM is actually wholly uninterested in victory given this prompting.
It's exactly like the strawberry problem. No LLM can count the letters in a word. But when shown that, they explicitly were taught to recognize the prompt and count the letters in the word. They didn't create a letter counting algorithm, but they did build a table of words and letter counts. And every "new" LLM explicitly looks for a phrase that looks like "how many Rs are in stawberry" and then the LLM looks in the "letters in words" neural network instead of the the "what is the next likely word in this sentence net".
All "new" LLMs (in the next few weeks) will suddenly become decent at chess because they will have weighted preference to look at the "chess moves" neural net instead of the "random numbers and letters sequence" neural net when they detect a sentence that looks like "d4, d5; nd3, ?" etc.
I'm still amazed that this works at all, since it lacks actual board state.
This alone should put to rest all the arguments that LLMs lack a world model, that they're just manipulating words, that they're just spitting probabilistic answers, and so on.
I am shopping things for a bigger DIY project for a few months now and recently it started to hallucinate products with the specifications I need.
In fact it returns mainly broken code, hallucinates functions that never existed (zero Google results) and so on.
Not sure if I am using it more, or it just got much more useless.