LLMs, Theory of Mind, and Cheryl's Birthday
77 points by stereoabuse 5 hours ago | 31 comments
  • erwald 3 hours ago |
    o1 mini seems to get it on the first try (I didn't vet the code, but I tested it and it works on both examples provided in the notebook, `dates` and `gabe_dates`):

        from collections import defaultdict
        
        def find_cheryls_birthday(possible_dates):
            # Parse the dates into month and day
            dates = [date.split() for date in possible_dates]
            months = [month for month, day in dates]
            days = [day for month, day in dates]
        
            # Step 1: Albert knows the month and says he doesn't know the birthday
            # and that Bernard doesn't know either. This implies the month has no unique days.
            month_counts = defaultdict(int)
            day_counts = defaultdict(int)
            for month, day in dates:
                month_counts[month] += 1
                day_counts[day] += 1
        
            # Months with all days appearing more than once
            possible_months = [month for month in month_counts if all(day_counts[day] > 1 for m, day in dates if m == month)]
            filtered_dates = [date for date in dates if date[0] in possible_months]
        
            # Step 2: Bernard knows the day and now knows the birthday
            # This means the day is unique in the filtered dates
            filtered_days = defaultdict(int)
            for month, day in filtered_dates:
                filtered_days[day] += 1
            possible_days = [day for day in filtered_days if filtered_days[day] == 1]
            filtered_dates = [date for date in filtered_dates if date[1] in possible_days]
        
            # Step 3: Albert now knows the birthday, so the month must be unique in remaining dates
            possible_months = defaultdict(int)
            for month, day in filtered_dates:
                possible_months[month] += 1
            final_dates = [date for date in filtered_dates if possible_months[date[0]] == 1]
        
            # Convert back to original format
            return ' '.join(final_dates[0]) if final_dates else "No unique solution found."
        
        # Example usage:
        possible_dates = [
            "May 15", "May 16", "May 19",
            "June 17", "June 18",
            "July 14", "July 16",
            "August 14", "August 15", "August 17"
        ]
        
        birthday = find_cheryls_birthday(possible_dates)
        print(f"Cheryl's Birthday is on {birthday}.")
    • mewpmewp2 2 hours ago |
      In addition to that after they create the 1st program with mistakes the author should have showed them the invalid output and let them have a chance to fix it. For humans solving this on the first try without running the code also tends to frequently not work.
    • fragmede 2 hours ago |
      "seems to" isn't good enough, especially since it's entirely possible to generate code that doesn't give the right answer. 4o is able to write some bad code, run it, recognize that it's bad, and then fix it, if you tell it to.

      https://chatgpt.com/share/670086ed-67bc-8009-b96c-39e539791f...

  • joe_the_user 3 hours ago |
    Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.

    It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate, so a small increase in problem complexity requires a significant increase in LLM power.

    • godelski 2 hours ago |

        > Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.
      
      That's orthogonal.

      If we are pointing in the right direction(s) then yes, next iteration could resolve all problems.

      If we are not pointing in the right direction(s) then no, next iteration will not resolve these problems.

      Given LLMs rapid improvement in regurgitating knowledge from their training data but simultaneously slow improvement in their ability to generalize (such as logic "puzzles"), I think it is naive to assume we're pointed in the right direction. Maybe we're even pointing in mostly the right direction. But why assume we are?

      We can continue in the direction we are going while simultaneously considering it might not be well aligned. If we are well aligned, that gives us more confidence and makes gathering funding easier. If we aren't, well it is easier to course correct sooner than later. In either case, you benefit from the analysis.

      Understanding why things fail is more important than understanding why things succeed.

      • Uehreka 38 minutes ago |
        GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.

        Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.

        By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.

        • godelski 15 minutes ago |
          Oh, thanks for the correction. I did misinterpret.

          Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.

          What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.

    • aithrowawaycomm 17 minutes ago |
      > It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate

      That would be any problem more complicated than O(n) complexity, even with chain-of-thought prompting[1].

      Note that the O(n) thing can bite you in all sorts of unintuitive ways: if the LLM+CoT can perform an O(n) Task A and O(m) Task B, then it can't do the O(nm) task "for every step of A, perform B on the result" unless you come up with a task-specific prompt outlining the solution. The alternative is to play RLHF Whack-A-Mole, separately training the LLM on the combined task. (I think this weakness might be why LLMs are hitting a wall in enterprise deployment, and also explains why LLM agents don't actually work.) The only way this will get fixed is with a fundamentally more sophisticated architecture.

      [1] https://www.quantamagazine.org/how-chain-of-thought-reasonin...

  • jfcoa 2 hours ago |
    This seems like a terrible test case since python examples are readily available in the training data: https://rosettacode.org/wiki/Cheryl%27s_birthday

    It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.

  • whack 2 hours ago |
    > At least with respect to this problem, they had no theory of mind.

    This is very interesting and insightful, but I take issue with the above conclusion. Your average software engineer would probably fail to code up a python solution to this problem. But most people would agree that the average software engineer, and the average person, possesses some theory of mind.

    This seems to be a pattern I'm noticing with AI. The goalposts keep moving. When I was a kid, the turing test was the holy grail for "artificial intelligence." Now, your run-of-the-mill LLM can breeze through the turing test. But no one seems to care. "They are just imitating us, that doesn't count." Every couple years, AI/ML systems make revolutionary advances, but everyone pretends it's not a big deal because of some new excuse. The latest one being "LLMs can't write a python program to solve an entire class of very challenging logic problems. Therefore LLMs possess no theory of mind."

    Let me stick my neck out and say something controversial. Are the latest LLMs as smart as Peter Norvig? No. Are they smarter than your average human? Yes. Can they outperform your average human at a randomly chosen cognitive task that has real-world applications? Yes. This is pretty darn revolutionary. We have crossed the rubicon. We are watching history unfold in real-time.

    • Jerrrrrrry 2 hours ago |
      The goalposts will continue to move until GDP improves.
    • titanomachy 2 hours ago |
      I consider myself a pretty average human programmer, and I was able to solve the logic puzzle and write a python program for it in ~10 mins. [0]

      I agree though, the people who are unable to solve this probably still have a theory of mind. It seems like we're setting a rather high bar.

      [0] https://pastebin.com/q33K0HJ1

  • jawns 2 hours ago |
    A long time ago, I created a version of this challenge called "Cheryl's Murder."

    My notebook not only solves logical induction problems like "Cheryl's Birthday," but it also generates them.

    https://github.com/shaungallagher/cheryls-murder/blob/master...

    • airstrike an hour ago |
      This is awesome, thanks for sharing
  • godelski 2 hours ago |
    I think the test is better than many other commenters are giving credit. It reminds me of responses to the river crossing problems. The reason people do tests like this is because we know the answer a priori or can determine the answer. Reasoning tests are about generalization, and this means you have to be able to generalize based on the logic.

    So the author knows that the question is spoiled, because they know that the model was trained on wiki. They also tested to see if the model is familiar with the problem in the first place. In fact, you too can confirm this by asking "What is the logic puzzle, Cheryl's birthday?" and they will spit you out the correct answer.

    The problem also went viral, so there are even variations of this. That should tell us that the model has not just been trained on it, but that it has seen it in various forms and we know that this increases its ability to generalize and perform the task.

    So then we're left with reasoning. How do we understand reasoning? It is the logical steps. But we need to make sure that this is distinct from memorization. So throwing in twists (as people do in the river puzzles) is a way to distinguish memory from logic. That's where these models fail.

    People always complain that "oh, but humans can't do it." I refer to this as "proof by self-incompetence." (I also see it claimed when it isn't actually true) But not everybody reasons, and not all the time (trivial cases are when you're asleep or in a coma, but it also includes things like when you're hangry or just dumb). Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).

    There's also a key part that many people are missing in the analysis. The models were explicitly asked to *generalize* the problem.

    I'll give some comments about letting them attempt to solve iteratively, but this is often very tricky. I see this with the river crossing puzzles frequently, where there is information leakage passed back to the algo. Asking a followup question like "are you sure" is actually a hint. You typically don't ask that question when it is correct. Though newer models will not always apologize for being wrong, when actually correct, when they are sufficiently trained on that problem. You'll find that in these situations if you run the same prompt (in new clean sessions) multiple times that the variance in the output is very low.

    Overall, a good way to catch LLMs in differentiating reasoning from memorization is getting them to show their work, the steps in between. It isn't uncommon for them to get the right answer but have wrong steps, even in math problems. This is always a clear demonstration of memorization rather than reasoning. It is literally the subtly that matters.

    I suspect that one of the difficulties in humans analyzing LLMs is that there is no other entity that is capable of performing such feats that does not also have a theory of mind and a world model. But a good analogy might be in facts that you know, but not understanding why they are "the answer." I'm sure there's many people who have memorized complexities for many sorting algos or leet code problems and couldn't derive the answer themselves.

    But I really don't understand why we *need* LLMs to reason? A dictionary memorizes things, and so does wikipedia. Their lack in ability to reason does not make them any less marvelous of inventions/tools. But maybe, if we're looking to create intelligent and thinking machines, it isn't as simple as scale. We love simple things, but few things are simple and correct (though far more things are simple and approximately correct).

    • og_kalu an hour ago |
      >I think the test is better than many other commenters are giving credit.

      The test is fine. The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x. x may be important to your test for sure but you're testing for something else too. Or maybe humans don't have x after all. Either conclusion is logically consistent at least. It's the middle, "rules for thee but not me" conclusions that are tiring.

      Like it's theory of mind. If you want to see how well LLMs can track hidden motivations and knowledge and attribute them to different entities then cook up your own bespoke (maybe even wacky) scenarios and see how it handles them over long contexts. That's how to test for theory of mind. By doing what the other did here, you're introducing a few factors that may derail the output and have nothing to do with ToM.

      >Humans are different from LLMs. LLMs are giving it 100%, every time.

      I don't know how anyone who uses LLMs extensively can genuinely believe this to be true. I mean i'm not sure what this means ? Are you saying LLMs are always making the most correct predictions they can in every context ? Because that's just blatantly false.

      Yes models overfit. Yes you can trick them. No it does not necessarily mean they haven't generalized well enough to solve your "subtle variation". And if people weren't so hellbent on being able to say "aha" to the machine, they would see that.

      If you're really interested in seeing how well the model has learnt the underling logic steps why bother with the trickery ? Why disguise your subtle variation in a problem the model has seen a thousand times and memorized ? You can have the same question requiring the same logic but written in a way that doesn't immediately point to an overfit problem (you don't need to worry about if hinting is 'cheating' or not) How is that not a better test of generalization ?

      And i'm not saying that the tests with the trickery or subterfuge are useless or to be done away with, just that you are no longer just testing the ability to generalize.

      • godelski an hour ago |

          > The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x
        
        I think you misunderstand, but it's a common misunderstanding.

        Humans have the *ability* to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)

        So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".

        It's correct to point out that not all humans can solve this puzzle. But that's an irrelevant fact because the premise is not that human always reason. If you'd like to make the counter argument that LLMs are like humans in that they have the ability to reason but don't always, then you got to provide strong evidence (just like you need to provide strong evidence that LLMs can reason). But this (both) is quite hard to prove because humans aren't entropy minimizers trained on petabytes of text. It's easier to test humans because we generally have a much better idea of what they've been trained on and we can also sample from different humans that have been trained on different types of data.

        And here's a real kicker, when you've found a human that can solve a problem (meaning not just state the answer but show their work) nearly all of them can adapt easily to novel augmentations.

        So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these. There's no slight of hand. There's no magic tokens, no silly or stage wording that would be easily misinterpreted. There's a big difference between a model getting an answer wrong and a promoter tricking the model.

        • og_kalu 31 minutes ago |
          >I think you misunderstand, but it's a common misunderstanding. Humans have the ability to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)

          >So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".

          This is all rather irrelevant here. You can sit a human for some arbitrarily long time on this test and he/she will be unable to solve it even if the human has theory of mind (the property we're looking for) the entire duration of the test, ergo the test is not properly testing for the property of theory of mind.

          >So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these.

          Models are trained to predict text. Solving problems is just what is often the natural consequence of this objective.

          It's trickery the same way it can be considered trickery when professors would do it to human testers. Humans and Machines that memorize things take shortcuts in prediction when they encounter what they've memorized "in the wild". That's the entire point of memorization really.

          The human or model might fail not because it lacks the reasoning abilities to solve your problem, but because its attention is diverted by misleading cues or subtle twists in phrasing.

          And if you care about the latter, fine!, that's not a bad thing to care about but then don't pretend you are only testing raw problem solving ability.

    • Jerrrrrrry an hour ago |

        Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).
      
      Ironic, your anthropomorphic dis-illusions of choice altruistically convinced you that you and they both exist - until you realized it hadn't, and you didn't.

        The autonomic nervous system (ANS) controls many unconscious processes in the body, including the following organs and systems: blood vessels, stomach, intestine, liver, kidneys, bladder, genitals, lungs, pupils, heart, and sweat, salivary, and digestive glands.
      
      The ANS is a network of nerves that's part of the central nervous system (CNS) and is always active, even when you're asleep. It's essential for life - the war against entropy is ceaseless.
  • nextworddev 2 hours ago |
    The majority of humans in flesh can't solve the problem - so we need alternate measures for judging theory of mind capabilities in LLMs
    • Terr_ an hour ago |
      What about the difference that the human knows what they don't know?

      In contrast, the LLM knows nothing, but confidently half regurgitates correlational text that it is seen before.

      • og_kalu an hour ago |
        As far as all the research on this goes, LLMs (internally) mostly know what they know, but incentivizing that information making it to output is difficult.
    • godelski an hour ago |

        > The majority of humans in flesh can't solve the problem
      
      This is a non sequitur.

      All humans have the capacity to reason, but not all humans reason. There's a big difference. Hell, even GPT-4o will tell you this.

      Here, don't trust me, let's ask GPT about memorization and reasoning

        Prompt: 
        Suppose a LLM is able to get the right answer to a question, but when asked about the steps to arrive at the answer they are not able to provide an accurate solution. Does this tell us anything about a model's ability to reason or its ability to memorize?  What about when it gets the right answer but the explanation would lead to a different answer than the one stated?
      
        Response:
        These scenarios highlight important aspects of a model's capabilities and limitations regarding reasoning and memorization.  Let's break them down:
      
      
        1. **LLM Gets the Right Answer but Can't Explain the Steps:**
          - **Memorization vs. Reasoning:** If a model provides a correct answer without accurately explaining the steps, it might indicate that the model is relying more on memorization rather than understanding or reasoning. The model could have seen similar questions in its training data and learned to associate certain questions with specific answers without truly understanding the underlying principles.
          - **Reasoning Limitations:** This situation suggests that the model may not have a robust ability to reason through the problem-solving process. It highlights a gap between recognizing patterns in data and applying logical reasoning to derive answers.
      
        2. **LLM Provides Correct Answer but Incorrect Explanation:**
          - **Memorization of Answers:** This might suggest that the model has memorized the answer from similar examples in the training data but does not understand the reasoning process. It could be retrieving an answer that it "knows" is correct without understanding why.
          - **Inconsistent Reasoning Ability:** Giving a correct answer with an explanation that would lead to a different answer indicates inconsistencies in its reasoning ability. It may reflect that the model's internal heuristics for generating explanations are not aligned with the logic used to derive answers.
      
        In both cases, these issues highlight the challenges in AI related to understanding vs. pattern recognition. While LLMs are excellent at recognizing patterns and retrieving information, their ability to reason logically and consistently can be limited. This differentiation is crucial when evaluating the capabilities of AI models, particularly in contexts where understanding and accurate reasoning are essential.
    • aithrowawaycomm 41 minutes ago |
      This doesn't measure theory of mind at all, it's just a silly logic puzzle. What we need are AI researchers who have read a psychology book and understand what theory of mind experiments are actually trying to demonstrate.
  • extr 2 hours ago |
    this is an interesting problem but it’s more of a logic problem than a true test of theory of mind. when i think “theory of mind” i think being able to model an external agent with complete knowledge, incentives, and behavior. i would not doubt LLMs have something close to this for humans, almost by accident since they are trained on human outputs.
  • m3kw9 an hour ago |
    could be an architectual issue with the LLMs because you need to juggle a lot of states just from one statement regarding a big problem. Sort of like if you ask it to write an app like facebook. It would give you a bunch of crap, which is worse.
  • gkfasdfasdf an hour ago |
    This question was posed to o1, it is able to reason through it - but now I wonder if that is because the model is already aware of the puzzle.

    https://x.com/d_feldman/status/1834313124058726894

    • cdfuller an hour ago |
      I think that could be likely. I just asked 4o "When is Cheryl's birthday?" without any other context and was given this reply

      Cheryl's birthday puzzle is a logic problem where Albert and Bernard are trying to figure out Cheryl's birthday based on certain clues.

      Cheryl provides them with ten possible dates: May 15, May 16, May 19, June 17, June 18, July 14, July 16, August 14, August 15, and August 17.

      Here’s the reasoning:

      1. Albert knows the month and Bernard knows the day.

      2. Albert says he knows Cheryl’s birthday, meaning May and June can be eliminated because they contain unique days (May 19 and June 18). If Albert had been told May or June, he wouldn’t know for sure.

      3. Bernard, knowing this, says he now knows Cheryl’s birthday. This eliminates the remaining dates with unique days (July 14 and August 14).

      4. Albert then confirms that he also knows the birthday, meaning Cheryl’s birthday must be in July or August, but on a date with no unique days left: July 16, August 15, or August 17.

      Thus, Cheryl's birthday is *July 16*.

  • ynniv an hour ago |
    The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.
  • aithrowawaycomm an hour ago |
    AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this. Theory of mind is about attributing mental states to others, not information. What Norvig has done here is present a logic puzzle, one that works equally well when the agents are Prolog programs instead of clever children. There's no "mind" in this puzzle at all. Norvig is being childishly ignorant to call this "theory of mind." It's hard to overstate my contempt for this kind of useless junk science, especially when it comes from an impressive pedigree.

    Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.

  • pfisherman 8 minutes ago |
    LLMs and NLP are to verbal reasoning what the calculator is to quantitative reasoning.

    Language and by extension verbal reasoning is full of ambiguity and semantic slipperiness. For example, what degree of semantic similarity distinguishes synonymous from synonym-ish concepts? When do we partition concepts into homonyms?

    I think part of the problem with how people evaluate LLMs is that the expectations that people have. Natural language != ontology. The expectation should be more Chomsky and less Boole. Asking it to solve math problems written in paragraph form is a waste of time. Use a calculator for that! Solving riddles? Code it up in prolog!

    Instead you should be thinking of what operations you can do on concepts, meaning, and abstract ideas! That is what these things do.