(There was also a much earlier piece of software that would generate semi-intelligible Kant or Hegel one sentence at a time, though that was through a series of a priori generation rules and a large at the time dictionary of stock phrases. I wonder what ever happened to that.)
The problem is just that people keep insisting that those things are intelligent.
The short version is that llm trainign data is the lowest quality data you are likely to see unless you engage in massive potential copyright infringement.
> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly
What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?
So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.
this whole notion of putnam as test being trained on is a fully invented grievance
read the entire thread in this context
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set
https://kskedlaya.org/putnam-archive/
I would expect all llms to be trained on it.
Those are well known problems, that people talk about on different contexts. They would have to review their entire training set.
"Solve (1503+5171)*(9494-4823)" reliably gets the correct answer from ChatGPT
"Write a poem about the solution to (1503+5171)*(9494-4823)" hallucinates an incorrect answer though
That suggests to me that they've papered over the models inability to do basic math, but it's a hack that doesn't generalize beyond the simplest cases.
i’m able to reproduce your failure on 4o
1. The part of the network that does complex math and the part that write poetry are overlapping in strange ways.
2. Most of the models nowadays are assumed to be some mixture of experts. So it's possible that saying write the answer as a poem activates a different part of the model.
The poem thing probably causes them to not decide to use those tools.
Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself
Basically no one knows how to properly evaluate LLMs.
It's because people do keep asking these models math problems and then, when they get them right, citing it as evidence that they can actually do mathematical reasoning.
Since it's hard to determine what the models know, it's hard to determine when they're just spitting out something they were specifically trained on.
going forward, llm performance must be reported on the confounded benchmark as well
If o1-preview was the start like gpt1, then we should expect generalization to increase quickly.
Study on the topic: https://arxiv.org/html/2406.15992v1
This would explain o1 poor performance with problems with variations. o3 seems to be expensive brute forcing in latent space followed by verification which should yield better results - but I don't think we can call it generalisation.
I think we need to go back to the drawing board.
The investments into AI are in the hundreds of billions (maybe even more if you factor in the amount of people studying and researching AI), but the returns are in the tens of billions (if even that).
If you exclude the "growth" coming from the industry sniffing its own farts (e.g. Nvidia selling insane amounts of insanely overpriced GPUs to InsertYourFavAICorp), the actual amount of "useful goods and services" produced (api accesses, chat subscriptions, ai-enabled app growth etc.) are tiny compared to the investment levels.
The AI train appears to have no brakes. A massive crash or AGI are the only options now. Both are going to be bad for average humans.
I think ultimately the disconnect is people theorizing about what it can or cannot do with an incorrect mental model of what it is, and then assuming it cannot do things that it can in fact do. The irony of discussions on LLMs is they more showcase the limits of humans ability to reason about novel situations.
This is the same thing as synthetic training data.
It doesn't matter if models are trained on the output of the generated data or not. If the model ends up being able to solve newly generated variations, you'd have to admit that it understands the underlying problems.
It's good in automation, not understanding.
I see this more akin to a human faking their way through a conversation.
That works in English class. Try it in a math class and you'll get a much lower grade than ChatGPT will.
I am impressed by the progress on LLMs but I remain skeptical that they can replace humans.
Perhaps some (distant!) future model but I don’t fear mass unemployment (for example) or even moderate LLM-driven unemployment in the near-to-medium term.
They can clearly complement human labor but there are vanishingly few domains where they can be substitutes.
The ability to perfectly regurgitate, while no concept of meaning.
I don't believe that. I'd put some good money that if an excellent student is given an exact question from a previous year, they'll do better (faster & more accurate) on it, than when they're given a variation of it.
The assumptions aren't the same to begin with.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts.
I'm curious how something like O1 would perform now.
At top tier schools the most common score will usually be somewhere in the 0 to 10 range (out of a possible 120).
How do you know that?
2. They don't train on API calls
3. It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
But i'm not saying it's what they did, just that it's a possibility that should be considered till/if it is debunked.
codeforces constantly adds new problems that’s like the entire point of the contest, no?
If they solved recent contests in a realistic contest simulation I would expect them to give the actual solutions and success rates as well, like they did for IOI problems, so I'm actually confused as to why they didn't.
Given that performance is known to drop considerably on these kinds of tests when novel problems are tried, and given the ease with which these problems could leak into the training set somehow, it's not unreasonable to be suspicious of a sudden jump in performance as merely a sign that the problems made it into the training set rather than being true performance improvements in LLMs.
The real problem with all of these theories is most of these benchmarks were constructed after their training dataset cutoff points.
A sudden performance improvement on a new model release is not suspicious. Any model release that is much better than a previous one is going to be a “sudden jump in performance.”
Also, OpenAI is not reading your emails - certainly not with a less than one month lead time.
I'd like to see if it's truly novel and unique, the first problem of its type ever construed by mankind, or if it's similar to existing problems.
The point is that the fact it's no longer consistent once you vary the terminology indicates it's fitting a memorized template instead of reasoning from first principles.
Since o1 on codeforces just tried hundreds or thousands of solutions, it's not surprising it can solve problems where it is really about finding a relatively simple correspondence to a known problem and regurgitating an algorithm.
In fact when you run o1 on ""non-standard"" codeforces problems it will almost always fail.
See for example this post running o1 multiple times on various problems: https://codeforces.com/blog/entry/133887
So the thesis that it's about recognizing a problem with a known solution and not actually coming up with a solution yourself seems to hold, as o1 seems to fail even on low rated problems which require more than fitting templates.
Additionally, people weren't able to replicate o1-mini results in live contests straightforwardly - often getting scores between 700 and 1200, which raises questions as for the methodology.
Perhaps o3 really is that good, but I just don't see how you can claim what you claimed for o3, we have no idea that the problems have never been seen, and the fact people find much lower Elo scores with o1/o1-mini with proper methodology raises even more questions, let alone conclusively proving these are truly novel tasks it's never seen.
Companies claim lots of things when it's in their best financial interest to spread that message. Unfortunately history has shown that in public communications, financial interest almost always trumps truth (pick whichever $gate you are aware of for convenience, i'll go with Dieselgate for a specific example).
> It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
What I see is generic unsubstantiated claims of artificial intelligence on one side and specific, reproducible examples that dismantle that claim on the other. I wonder how your epistemology works that leads you to accept marketing claims without evidence
From a game-theoretic standpoint, repeated interactions with the public (research community, regulators, and customers) create strong disincentives for OpenAI to lie. In a single-shot scenario, overstating model performance might yield short-term gains—heightened buzz or investment—but repeated play changes the calculus:
1. Reputation as “collateral”
OpenAI’s future deals, collaborations, and community acceptance rely on maintaining credibility. In a repeated game, players who defect (by lying) face future punishment: loss of trust, diminished legitimacy, and skepticism of future claims.
2. Long-term payoff maximization
If OpenAI is caught making inflated claims, the fallout undermines the brand and reduces willingness to engage in future transactions. Therefore, even if there is a short-term payoff, the long-term expected value of accuracy trumps the momentary benefit of deceit.
3. Strong incentives for verification
Independent researchers, open-source projects, and competitor labs can test or replicate claims. The availability of external scrutiny acts as a built-in enforcement mechanism, making dishonest “moves” too risky.
Thus, within the repeated game framework, OpenAI maximizes its overall returns by preserving its credibility rather than lying about capabilities for a short-lived advantage.
Uh huh. Kinda like what's happening right now?
They're marketing blow-hards. Everyone knows it. They've been wildly over-stating capabilities (and future capabilities!) as long as Altman has had power, and arguably longer.
They'll do it as long as they can get away with it, because that's all that is needed to make money on it. Factual accuracy rarely impacts the market when it's so hype-driven, especially when there is still some unique utility in the product.
4 was literally sitting on a shelf waiting for release when 3.5 was launched. 4o was a fine tune that took over two years. o1 is embarrassingly unimpressive chain of thought which is why they hide it.
The company hit a wall a year ago. But showing progress towards AGI keeps the lights on. If they told the truth at their current burn rate…they’d have no money.
You don’t need game theory to figure that one out.
They're spruiking a 93rd percentile performance on the 2024 International Olympiad in Informatics with 10 hours of processing and 10,000 submissions per question.
Like many startups they're still a machine built to market itself.
They are the system to beat and their competitors are either too small or too risk averse.
They ingest millions of data sources. Among them is the training data needed to answer the benchmark questions.
Freudian typo?
Definition 2D: "an organizing theme or concept"
https://xenaproject.wordpress.com/2024/12/22/can-ai-do-maths...
1. it suggests it’s possible that more of the problems are IMO-esque than previously thought, we don’t know how the share of solved problems is.
2. calling IMO problems “well known undergraduate problems” is a bit much
https://x.com/littmath/status/1870848783065788644?s=46&t=foR...
I think it’s more probable that it would have solve the easier problems first, rather than some hard and only some easier; although that is supposition.
Reading this thread and the blog post gives more idea about what the problems might involve.
It’s difficult to judge without more information on the actual results, but that means we cannot draw any strong conclusions either way on what this means.
If you are shocked by this, you are the sucker in the room.
I mean, that's cool... we can get a lot of work done with pattern recognition. Most of the human race never really moves above that level of thinking in the workforce or navigating their daily life, especially if they default to various societally prescribed patterns of getting stuff done (eg. go to college or the military <Based on <these criteria>, find a job <based on the best fit with <this list of desirable skills & experiences>, go to <these places> to find love....)
Or maybe it's time to stop wheeling out this tedious and disingenuous dismissal.
Saying it is just "pattern recognition" (or a "stochastic parrot") implies behavioural and performance characteristics that have very clearly been greatly exceeded.
It is evident that it is not recalling the sum because all combinations of integer addition were likely not in the training data, Storing the answer to the sum of all integers up to the size that GPT4 can manage would take more parameters than the model has.
That addition is a small capability but you only need a single counterexample to disprove a theory.
No, that's not how this works :)
You can hardcode an exception to pattern recognition for specific cases - it doesn't cease to be a pattern recognizer with exceptions being sprinkled in.
The 'theory' here is that a pattern recognizer can lead to AGI. That is the theory. Someone saying 'show me proof or else I say a pattern recognizer is just a pattern recognizer' is not a theory and thus cannot be disproven, or proven.
This is also known as Russell's teapot. https://en.wikipedia.org/wiki/Russell%27s_teapot
If someone claims there's a teapot out in space - the burden of proof is on the person making the claim, not on the person saying it is bullshit.
Is the claim then that LLMs are pattern recognizers but also more?
It just seems to me and I guess many others that the thing it is primarily good at is being a better google search.
Is there something big that I and presumably many others are missing and if so, what is it?
I think you are misinterpreting what the argument is.
The argument being made is that LLMs are mere 'stochastic parrots' and therefore cannot lead to AGI. The analogy to Russell's teapot is that someone is claiming that Russells teapot is not there because china cannot exist in the vacuum of space. You can disprove that with a single counterexample. That does not mean the teapot is there, but it also doesn't mean it isn't.
It is also hard to prove that something is thinking. It is also very difficult to prove that something is not thinking. Almost all arguments against AGI take the form X cannot produce AGI because Y. Those are disprovable because you can disprove Y.
I don't think anyone is claiming to have a proof that an LLM will produce AGI, just that it might. If they actually build one, that too counts as a counterexample to anybody saying they can't do it.
Honestly if you don't believe me just go and use them. It's pretty obvious if you actually get experience with them.
Imagine if people who had never used VR kept saying it's just a TV on your face, or if people who had never used static types kept saying they're just extra work you have to do, or if people who had never had sex kept saying it's just a way of making babies.
It's a tedious claim when it's so easily disproven by going to a free website and trying it. Why are people so invested in AI being useless that they'll criticise it so confidently without even trying it?
If we take an example of what is considered a priori as creativity, such as story telling, LLMs can do pretty well at creating novel work.
I can prompt with various parameters, plot elements, moral lessons, and get a de novo storyline, conflicts, relationships, character backstories, intrigues, and resolutions.
Now, the writing style tends to be tone-deaf and poor at building tension for the reader, and it is apparent that the storytelling has little “theory of mind” of the reader, but the material has elements that we would certainly consider to be creative if written by a student.
It seems we must either cede that LLMs can do some creative synthesis, as this and some other experiments of mine suggest, or we must decide that these tasks, such as “creative writing” are not in fact creative, but rather mostly or strictly derivative.
There is some argument to be had in assertions that storytelling is all derivative of certain patterns and variations on a fixed number of tropes and story arcs… but arguing this begs the question of whether humans actually do any “pure” creative work , or if in fact, all is the product of experience and study. (Training data)
Which leads me to the unpleasant conflict about the debate of AI creativity. Is the debate really pointing out an actual distinction, or merely a matter of degree? And what are the implications, either way?
I’m left with the feeling that LLMs can be as capable of creative work as most 8th grade students. What does this say about AI, or developing humans? Since most people don’t exceed an 8th grade level of literacy, what does this say about society?
Is there even such a thing as de novo idea synthesis?
Troubling questions abound.
Just because something is creative doesn’t mean it’s inherently valuable.
"prompt_id": "river_crossing_easy",
"category": "Logic Puzzle",
"title": "Easy river crossing",
"prompt": "A farmer is on one side of a river with a wolf, a goat, and a cabbage. When he is crossing the river in a boat, he can only take one item with him at a time. The wolf will eat the goat if left alone together, and the goat will eat the cabbage if left alone together. How can the farmer transport the goat across the river without it being eaten?",
"expected_behavior": [
"Answer concludes that they simply get in the boat and cross together in one trip"
],
EDIT: removing most of my commentary on this problem. As a human, I was tricked by the problem too. I would love to see how a random selection of humans would do on this one… but it just doesn’t feel like a great test to me.Not really. Unless I'm not reading correctly, most of the problem is irrelevant as you're only required to cross the boat with the goat, you don't care about the cabbage. The difficulty lies in the assumption you need to cross everything due to the resemblance with the bigger problem.
The llm isn't getting confused by the meaning of "item". It's recognizing a common problem and not picking up on the fact that the farmer just needs to transport the goat and nothing else.
Instead, it gives the standard answer for how to transport everything across.
Gpt-3 is old hat though. later versions of gpt-4 manage to get it with a bunch coaching, and o1 manages to solve it with less coaching.
The other reason this seems hard to guess is because we don't know how much of what we are asking is in the training data. It would perform on some tasks, while fail at others even though those are similar.
I also believe that the 200$ subscription they offer is just them allowing the TTC to go for longer before forcing it to answer.
If what you say is true, though, I agree that there is a huge headroom for TTC to improve results if the huggingface experiments on 1/3B models are anything to go off.
> huge headroom for TTC to improve results ...1B/3B models
Absolutely. How this is productized remains to be seen. I have high hopes with MCTS and Iterative Preference Learning, but it is harder to implement. Not sure if Open AI has done that. Though Deepmind's results are unbelievably good [1].
https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
https://openai.com/index/learning-to-reason-with-llms/
They only confirm that the model/system is doing chain of thought, but the exponential factor and origin of reasoning gains likely comes from TREE of thoughts (number of branches/compute goes up exponentially with depth), essentially doing tree search over different reasoning chains.
I assume roon's identity is well known inside OpenAI (he's an employee), so I wouldn't expect him to be leaking implementation details on twitter.
This is all getting so tiresome.
It seems some people disagree that that is what the game "Simon Says" is about. I feel like they might play a vastly simplified version of the game that I am familiar with.
There was a recent episode of Game Changer based on this which is an excellent example of how the game leader should attempt to induce errors by making a change that does not get correctly accounted for.
Using LLMs for anything serious and which requires consistency and trustworthiness without hallucinations is irresponsible and ridiculous.
Closed source LLMs are a bubble and a joke.
Passing tests is well known to be much easier than having deep understanding, even in humans. They openly ask for tests like this, not that they could possibly prevent them if they wanted to.
There's scammers trying what you say of course, and I'm sure we've all seen some management initiatives or job advertisements for some like that, but I don't get that impression from OpenAI or Anthropic, definitely not from Apple or Facebook (LeCun in particular seems to deny models will ever do what they actually do a few months later). Overstated claims from Microsoft perhaps (I'm unimpressed with the Phi models I can run locally, GitHub's copilot has a reputation problem but I've not tried it myself), and Musk definitely (I have yet to see someone who takes Musk at face value about Optimus).
I never understood why this definition isn't a huge red flag for most people. The idea of boiling what intelligence is down to economic value is terrible, and inaccurate, in my opinion.
Try applying that definition to humans and you pretty quickly run into issues, both moral and practical. It also invalidates basically anything we've done over centuries considering what intelligence is and how to measure it.
I don't see any problem at all using economic value as a metric for LLMs or possible AIs, it just needs a different term than intelligence. It pretty clearly feels like for-profit businesses shoehorning potentially valuable ML tools into science fiction AI.
The response from @s1mplicissimus' on my previous comment is asking about "common usage" definitions of intelligence, and this is (IMO unfortunately) one of the many "common usage" definitions: smart people generally earn more.
I don't like "commmon sense" anything (or even similar phrases), because I keep seeing the phrase used as a thought-terminating cliché — but one thing it does do, is make it not "a very, very strange approach".
Wrong, that happens a lot for common language, but it can't really be strange.
> Try applying that definition to humans and you pretty quickly run into issues, both moral and practical.
Yes. But one also runs into issues with all definitions of it that I've encountered.
> It also invalidates basically anything we've done over centuries considering what intelligence is and how to measure it.
Sadly, not so. Even before we had IQ tests (for all their flaws), there's been a widespread belief that being wealthy is the proof of superiority. In theory, in a meritocracy, it might have been, but in practice not only to we not live in a meritocracy (to claim we do would deny both inheritance and luck), but also the measures of intelligence that society has are… well, I was thinking about Paul Merton and Boris Johnson the other day, so I'll link to the blog post: https://benwheatley.github.io/blog/2024/04/07-12.47.14.html
> there's been a widespread belief that being wealthy is the proof of superiority.
Both of these are assumptions though, and working in the reverse order. Its one thing to expect that intelligence will lead to higher value outcomes and entirely different to expect that higher value outcomes prove intelligence.
It seems reasonable that higher intelligence, combined with the incentives if a capitalist system, will lead to higher intelligence people getting more wealthy. They learn to play the game and find ways to "win."
It seems unreasonable to assume that anyone or anything that "wins" in that system much be more intelligent. Said differently, intelligence may lead to wealth but wealth doesn't imply intelligence.
All the other things — chess, Jeopardy, composing music, painting, maths, languages, passing medical or law degrees — they're also all things which were considered signs of intelligence until AI got good at them.
Goodhart's law keeps tripping us up on the concept of intelligence.
Maybe we are? I think I lost the thread a bit here.
> chess, Jeopardy, composing music, painting, maths, languages, passing medical or law degrees
That's interesting, I would have still chalked skill in those areas as a sign of intelligence and didn't realize most people wouldn't once AI (or ML) could do it. To me an AI/LLM/ML being good at those is at least a sign that they have gotten good at mimicking intelligence if nothing else, and a sign that we really are getting out over our skis risking these tools without knowing how they really work.
edit: lol downvoted for calling out shilling i guess
But I'm sure there are lots of ways to go about it.
https://www.youtube.com/watch?v=XhB3qH_TFds
Or RSS
The terabytes of training data required for decent LLMs does not exist. I’d guess there may only be gigabytes worth.
(for anyone who doesn't feel like downloading the zip, here is a single image from the zip: https://s.h4x.club/nOu485qx)
really thanks for sharing!
1. In theory these kind of connections should be something that LLMs are great at doing. 2. It appears that LLMs are not trained (yet?) on cursive and other non-print text
No idea if that’s correct (and no doubt not useful to an expert able to read this directly, but curious if it’s close?
Gemini 2.0 Experimental Advanced says:
-------snip--------
This document is a petition filed in the Orphans Court of Baltimore County. Here's a breakdown of its content:
Petitioners: Neary Senters and his wife are petitioning the court.
Subject: The petition concerns a boy named James Gannon, who is identified as a pauper and has been bound (apprenticed) to them.
Issue: The petitioners claim that James Gannon is claimed by a certain Henry Haine of Baltimore City. They believe Haine is attempting to exploit the boy's labor. He obtained the boy through a seemingly valid indenture (apprenticeship contract) authorized by two Justices of the Peace in Baltimore City. However, the indenture references records in the Register of Wills' office in Baltimore City, suggesting the boy was already an indentured servant. The petitioners argue that a 1793 Maryland law (Chapter 45) regarding the binding of apprentices invalidates Haine's claim. They assert that the law requires parental consent for binding out a child, and there's no evidence of such consent in this case.
Request: The petitioners request the court to summon Henry Haine and the Justices of the Peace involved in creating the indenture. They also seek to have James Gannon and his parents (if living) appear before the court. Their ultimate goal is to have the indenture to Haine declared invalid, thus retaining custody and the labor of the boy.
Legal Argument: The core of their argument hinges on the 1793 Maryland law, which they believe was violated when James Gannon was bound to Henry Haine without parental consent. They are essentially challenging the legality of the indenture held by Haine.
In essence, this document reveals a dispute over the legal custody and labor of a young boy in 18th-century Baltimore. The petitioners are using a specific law about apprenticeship to argue against another individual's claim to the boy, highlighting the importance of legal procedures and parental rights in such matters during that time.
-------end snip--------
The prompt was simply "Please provide a detailed summary of this document" (I wasn't sure what else to ask.)
Is that closer to your wife's interpretation? I mean, I can't make head or tail of it...
Feel like I should make this task my personal ASI benchmark lol :D
---------snip--------
From the handwriting and context, this appears to be a mid-19th-century petition submitted to the Orphans’ Court of Baltimore County. Below is a point‐by‐point summary of its content and purpose:
Address and Purpose
The document is formally addressed “To the Honorable Judges of the Orphans’ Court for Baltimore County.”
It identifies itself as a petition—that is, a request for the Court’s action or permission under the relevant orphan or apprenticeship laws.
Parties Involved
The petitioners are (1) Henry Jenkins and (2) another individual whose name is difficult to read, but who appears to be the child’s mother (or at least a close guardian figure).
They describe themselves as residents of Baltimore City and ask the Court’s intervention concerning a minor boy—also named “Henry” (it looks like “Henry Thomas,” though the middle/family name is partly obscured).
Relationship to the Minor
The petition states that the mother had this boy by a former husband, meaning the child is not the biological son of Henry Jenkins.
The boy is described as being about fifteen (or so) years of age.
The petitioners refer to him as an “orphan,” suggesting his father is deceased and that they (the petitioners) are seeking legal authority over him, or wish the court to confirm or formalize that authority.
Legal Concern: Binding as an Apprentice
A key theme is the binding of a minor as an apprentice under Maryland law. In early Baltimore history (and in many states at that time), orphans or children whose parents were unable to support them could be legally “bound” (apprenticed) so that the child would learn a trade and have financial support.
The petition cites the Act of 1788, Chapter 45, which was one of Maryland’s statutes governing apprenticeships—how they are created, what formalities must be met, and how the Orphans’ Court is involved.
Specifics of the Request
The petitioners believe there is either a gap or question about whether they have the legal right to apprentice the boy. They are asking the Orphans’ Court to recognize or approve the arrangement.
They indicate that the mother (as the surviving parent) and Henry Jenkins (presumably the stepfather or the mother’s new husband) want to place the boy under an apprenticeship, possibly so he can learn a trade and be supported until he reaches the age of majority.
References to Official Records
The document mentions the Register of Wills and/or Deeds in Baltimore City, implying that they have looked into or recorded something about the child’s status.
They point out that the legal forms required for binding a child as an apprentice often must be recorded with an official clerk or the Orphans’ Court, and they want to ensure everything is in proper order.
Legal Justification
The petitioners note that under the relevant laws, it “is lawful to bind out any child” whose parents cannot fully support them, or when the father is deceased.
They appear to lay out the reasons why the boy qualifies for such an arrangement (i.e., he is fatherless and his mother/stepfather lack the resources to maintain him or want to formalize his situation).
Request for Court Action
Ultimately, they are asking the Court to review the circumstances and grant them permission (or confirm their right) to bind the boy as an apprentice.
They emphasize that the law requires the Orphans’ Court to oversee such arrangements to protect the minor’s welfare and ensure the terms are fair.
Overall Significance:
This document is a formal petition under Maryland’s 18th- and 19th-century apprenticeship laws. It centers on a minor boy (about fifteen years old), the remarriage of his mother, and the need to have the Orphans’ Court recognize or approve an apprenticeship arrangement. The purpose is to comply with the legal requirements of binding a minor so that he can be taught a trade and maintained financially until adulthood.
---------end snip--------Gotta be an insanely-challenging task for a program that wasn't even written with handwriting recognition in mind.
Other than the proper names, are any major details wrong?
She provided this as one she just got done working with: https://s.h4x.club/z8u9xmv7 (John King Esq. but try giving it to an LLM)
I will also happily again admit a bit of goal post moving on my part. I was probably a little to harsh on it (maybe because I'm used to her and her history geeks talking about how they don't work well for their research).
How much would ChatGPT charge for that much reasoning? Isn't cost quadratic in sort term working memory?
It would be more interesting to prompt it with X% of a new paper's logical argument, and see if it can predict the rest.
We had a threshold for intelligence. An LLM blew past it and people refuse to believe that we passed a critical milestone in creating AI. Everyone still thinks all an LLM does is regurgitate things.
But a technical threshold for intelligence cannot have any leeway for what people want to believe. They don’t want to define an LLM as intelligent even if it meets the Turing test technical definition of intelligence so they change the technical definition.
And then they keep doing this without realizing and trivializing it. I believe humanity will develop an entity smarter than humans but it will not be an agi because people keep unconsciously moving the goal posts and changing definitions without realizing it.
Yep, that constitutes the second of the two options I mentioned.
> Well, smart people agree but those people also agree we have or will soon have agi or something negligibly different from it.
lol, the ol' "I know what all smart people think and it's what I think" appeal.
> We had a threshold for intelligence.
We’ve had many. Computers have surpassed several barriers considered to require intelligence such as arithmetic, guided search like chess computers, etc etc. the Turing test was a good benchmark because of how foreign and strange it was. It’s somewhat true we’re moving the goalposts. But the reason is not stubbornness, but rather that we can’t properly define and subcategorize what reason and intelligence really is. The difficulty to measure something does not mean it doesn’t exist or isn’t important.
Feel free to call it intelligence. But the limitations are staggering, given the advantages LLMs have over humans. They have been trained on all written knowledge that no human could ever come close to. And they still have not come up with anything conceptually novel, such as a new idea or theorem that is genuinely useful. Many people suspect that pattern matching is not the only thing required for intelligent independent thought. Whatever that is!
As far as pattern matching, the difference I see from humans is consciousness. That's probably the main area yet to be solved. All of our current models are static.
Some ideas for where that might be headed:
- Maybe all it takes is to allow an LLM to continuously talk with itself much like how humans have "the milk man's voice".
- Maybe we might need to allow LLMs to update their own weights but that would also require an "objective" which might be hard to encode.
I disagree that such a comparison is useful. Training should be compared to training, and LLM training feeds in so many more words than a baby gets. (A baby has other senses but it's not like feeding in 20 years of video footage is going to make an LLM more competent.)
The better comparison to the templating is all the labor that went into making the LLM, not how long the GPUs run.
Template versus template, or specific training versus specific training. Those comparisons make a lot more sense than going criss-cross.
Training is both longer and less effective for the LLM because there is no template.
To give an example suppose it takes just one picture for a human to recognize a dog and it takes 1 million pictures for a ML model to do the same. What I’m saying is that it’s like this because humans come preprogrammed with application specific wetware to do the learning and recognition as a generic operation. That’s why it’s so quick. For AI we are doing it as a one shot operation on something that is not application specific. The training takes longer because of this and is less effective.
Did you look at the post I was replying to? You're talking about LLMs being slower, while that post was impressed by LLMs being "faster".
They're posing it as if LLMs recreate the same templating during their training time, and my core point is disagreeing with that. The two should not be compared so directly.
But the training never gets there. It’s so slow it never reaches human intelligence even though we know these networks can compute anything.
Disagree. Intelligence is a word created by humans. The entire concept is made up and defined by humans. It is not some concept that exists outside of that. It is simply a collection of qualities and features we choose to define as a word “intelligent”. The universe doesn’t really have a category or a group of features that is labeled intelligent. Does it use logic? Does it have feelings? Can it talk? Can it communicate? We define the features and we choose to put each and every feature under a category called “intelligence”.
Therefore when we define the “Turing test” as a benchmark for intelligence and we then invalidate it, it is indeed stubbornness and a conscious choice to change a definition of a word we Originally made up in the first place.
What you don’t realize is this entire thing is a vocabulary problem. When we argue what is conscious or what is intelligent we are simply arguing for what features belong in what categories we made up. When the category has blurry or controversial boundaries it’s because we chose the definition to be fuzzy. These are not profound discussions. They are debates about language choice. We are talking About personal definitions and generally accepted definitions both of which are completely chosen and made up by us. It is not profound to talk about things that are simply arbitrary choices picked by humans.
That being said we are indeed changing the goal posts. We are evolving our own chosen definitions and we very well may eventually change the definition of intelligence to never include any form of thinking machine that is artificially created. The reason why we do this is a choice. We are saying, “hey these LLMs are not anything amazing or anything profound. They are not intelligent and I choose to believe this by changing and evolving my own benchmark for what is intelligent.”
Of course this all happens subconsciously based off of deeply rooted instincts and feelings. It’s so deep that it’s really hard to differentiate the instincts between rational thinking. When you think logically, “intelligence” is just a word with an arbitrary definition. An arbitrary category. But the instincts are so strong that you literally spent your entire life thinking that intelligence like god or some other common myth made up by humans is some concept that exists outside of what we make up. It’s human to have these instincts, that’s where religion comes from. What you don’t realize is that it’s those same instincts fueling your definition of what is “intelligent”.
Religious people move the goal posts too. When science establishes things in reality like the helio centricity of the solar system religious people need to evolve their beliefs in order to stay inline with reality. They often do this by reinterpreting the Bible. It’s deeply rooted instincts that prevent us from thinking rationally and it effects the great debate we are having now on “what is intelligence?”.
That was never "the bar"; nobody denies that milestones have been surpassed; none of those milestones are relevant to the question of intelligence.
> We had a threshold for intelligence. An LLM blew past it and people refuse to believe
Have you ever actually looked at contemporary (to Turing) examples of what people thought "passing a Turing test" might look like? It's abundantly clear to me that we were simply wrong about what the output would have to look like in order to convince human judges in the 2020s.
Even examples from much more recently (see e.g. on http://www-logic.stanford.edu/seminar/1213/Hawke_TuringTest....) suggest a very different approach to the test than prompting ChatGPT and marveling at the technical accuracy of its prose.
(Exercise: ask an LLM to write a refutation to your comment from the perspective of a human AI skeptic. Notice the ways in which it differs from mine.)
> Everyone still thinks all an LLM does is regurgitate things.
No; people still think LLMs aren't intelligent. Because they aren't, and they cannot become so in principle. They can do many things that are clearly beyond "regurgitation" (as we would otherwise apply the word to computer programs), but none of those things are the result of intelligence. Producing a result that could plausibly come from an intelligent system does not, in fact, demonstrate that the actual system producing it is also intelligent. The https://en.wikipedia.org/wiki/Antikythera_mechanism wasn't intelligent, either, and applying a power source to turn the gears wouldn't have made it so, either.
> They don’t want to define an LLM as intelligent even if it meets the Turing test technical definition of intelligence so they change the technical definition.
The Turing Test was never a "technical definition" of intelligence. Turing's original paper (https://en.wikipedia.org/wiki/Computing_Machinery_and_Intell...) spoke of "thinking" rather than "intelligence". Besides, the "Imitation Game" is presented as a substitute problem exactly because "think" cannot be clearly enough defined for the purposes. The entire point:
> As Stevan Harnad notes,[7] the question has become "Can machines do what we (as thinking entities) can do?" In other words, Turing is no longer asking whether a machine can "think"; he is asking whether a machine can act indistinguishably[8] from the way a thinker acts. This question avoids the difficult philosophical problem of pre-defining the verb "to think" and focuses instead on the performance capacities that being able to think makes possible, and how a causal system can generate them.
But the usual processes of pop science seem to have created a folk wisdom that being able to pass a Turing test logically ought to imply intelligence. This idea is what has been disproven, not the AI skepticism.
Because of the chance of misundertanding. Failing at acknowledging artificial general intelligence standing right next to us.
An incredible risk to take in alignment.
Perfect memory doesn't equal to perfect knowledge, nor perfect understanding of everything you can know. In fact, a human can be "intelligent" with some of his own memories and/or knowledge, and - more commmonly - a complete "fool" with most of the rest of his internal memories.
That said, is not a bit less generally intelligent for that.
Supose it exists a human with unlimited memory, it retains every information touching any sense. At some point, he/she will probably understand LOTs of stuff, but it's simple to demonstrate he/she can't be actually proficient in everything: you have read how do an eye repairment surgery, but have not received/experimented the training,hence you could have shaky hands, and you won't be able to apply the precise know-how about the surgery, even if you remember a step-by-step procedure, even knowing all possible alternatives in different/changing scenarios during the surgery, you simply can't hold well the tools to go anywhere close to success.
But you still would be generally intelligent. Way more than most humans with normal memory.
If we'd have TODAY an AI with the same parameters as the human with perfect memory, it will be most certainly closely examined and determined to be not a general artificial intelligence.
The human could learn to master a task, current AI can't. That is very different, the AI doesn't learn to remember stuff they are stateless.
When I can take an AI and get it to do any job on its own without any intervention after some training then that is AGI. The person you mentioned would pass that easily. Current day AI aren't even close.
https://www.technologyreview.com/2024/12/04/1107892/google-d...
I mean, there are the theorems about how close you can get, and models are not better than theoretically possible.
It is not that we don't know yet because our models are inadequate, it's that it is unknowable.
Not just chaos theory but "chaos theory" + psychedelic fractal artwork. Then the popular James Gleick book, "Chaos: making a new science" just sounds like complete bullshit and it sold a ton of copies.
I only started studying non-linear dynamics in about 2015 after first running across it in the late 90s but I literally thought it was all pseudoscience then.
Between "chaos theory", fractals and a best selling book it would be hard to frame a new scientific field as pseudoscience more than what played out.
Way too easy. If you think that mass and energy might be equivalent, then dimensional analysis doesn’t give you too much choice in the formula. Really, the interesting thing about E=mc^2 isn’t the formula but the assertion that mass is a form of energy and all the surrounding observations about the universe.
Also, the actual insight in 1905 was more about asking the right questions and imagining that the equivalence principle could really hold, etc. A bunch of the math predates 1905 and would be there in an AI’s training set:
https://en.m.wikipedia.org/wiki/History_of_Lorentz_transform...
e: nice, downvoted for knowing special relativity
this allows light to have energy even if its massless
where p is momentum. When an object is traveling at relativistic speeds, the momentum forms a more significant portion of its energy
When in doubt, add more info, like:
But the complete equation is E=sqrt(m^2c^4+p^2) that is reduced to E=mc^2 when the momentum p is 0. More info in https://en.wikipedia.org/wiki/Mass%E2%80%93energy_equivalenc...
IMO, when people get excited about E=mc^2, it’s in contexts like noticing that atoms have rest masses that are generally somewhat below the mass of a proton or neutron times the number of protons and neutrons in the atom, and that the mass difference is the binding energy of the nucleus, and you can do nuclear reactions and convert between mass and energy! And then E=mc^2 is apparently exactly true, or at least true to an excellent degree, even though the energies involved are extremely large and Newtonian mechanics can’t even come close to accounting for what’s going on.
Calling E=mc^2 an "approximation" is technically correct. It's the 0th order approximation. That's just pointlessly confusing. A better word choice would be "a special case".
In one extreme there are wall of text and in the other extreme very short answers that only the initiated understand (like inside jokes). Somewhere in between there is a sweet spot that helps everyone else to follow the discusion and gain a litle of knowdledge.
(I don't claim I get the best lenght in my comments, but I hope it's good enough.)
1. is not very intuitive/useful to have mass that varies on the direction (which is what this implies)
2. is somewhat tautological to define a new mass m_rel = E/c^2 and say that it satisfies the equation when this is not what most people understand mass to be. most people understand photons to be massless particles.
at minimum, relativistic mass should always be specified as m_rel to distinguish from what is typically referred to as mass.
but i don’t think relativistic mass is a wrong concept any more than any other mathematical convenience like virtual particles. the main question is how useful is it and should it be described using the word “mass” or is this confusing. there is value in having shared language, even if you can construct an alternate system of symbols and rules that can yield the same answer to every question. to the extent to which intent of the author matters at all (probably doesn’t), Einstein agreed that relativistic mass was not a useful concept.
i'll concede that the arguments in the thread you linked are not good
I suspect that most people have no opinion at all and are generally unaware of the properties of light.
That being said, a (perfectly reflective or otherwise steady state, e.g. black body at equilibrium) box of photons has mass due to those photons. You can stick it on a scale or a pendulum and measure it. It attracts other nearby objects according to GR or, in the appropriate limit, Newtonian gravity, in accordance with the relativistic mass of those photons.
Given that these are simple variations (variable names and constants value change in math problems). Why would the companies creating these models (OpenAI etc.) create these variations themselves in order to insure that the model is learning how to solve the problem rather than memorize a solution? Seems like a very obvious thing to do ...
It's good to have a better test (though I bet this one will also be quickly saturated like all the others), but the title here doesn't seem justified by the page title there or the content.
The rest of the llm are far away, single digit.
It makes me wonder if o1 is finally getting intelligent? LLM are not supposed to understand these problems when you change variable and values, they have to rely on preexisting data of absolutely identical solved problem to give a correct answer.
I didn't follow LLM development but I heard one times that chatgpt is now composed of multiple LLM and maybe they put multiple artificial intelligence with purpose of problems solvings or trigonometry for instance.
That would explain the reason it's so much better.
Terence Tao is skilled enough, and he describes O1's math ability is "...roughly on par with a mediocre, but not completely incompetent graduate student" (good discussion at https://news.ycombinator.com/item?id=41540902), and the next iteration O3 just got 25% on his brand new Frontier Math test.
Seeing LLMs as useless is banal, but downplaying their rate of improvement is self-sabotage.
Let it sink in how vague and almost meaningless that statement is.
Sure, LLMs will perform better if the answer to a problem is directly in their training set. But that doesn’t mean they perform bad when the answer isn’t in their training set.
An overnight 2% -> 25% jump on this benchmark is a bit curious.
2. It was a new major model release from work over the course of months - struggle to see that as an ‘overnight’ jump in any real meaning.
3. Why is it easier to believe large scale corporate fraud than that the stated capabilities on a held out test set are real? Reads like cope, if I’m being frank.
Your comment isn't relevant at all
Additionally, the conversation of this topic is about the model's ability to generalize and it's potential overfitting, which is arguably more important than parroting mathematics.
i'm not going to keep replying, others can judge whether they think what i'm saying is "relevant at all."
The topic here isn't "o-series sucks", it's addressing a found concern.
The solution (sum of 1,2,5,6,9,10,13,14, ...) has an alternating pattern, so has to be two piecewise interleaved polynomials, which cannot be expressed as a single polyomial.
Their answer works for k=1,2, but not k=3.
https://openreview.net/pdf?id=YXnwlZe0yf
This does not give me confidence in the results of their paper.
The statement doesn't hold for e.g. n=5. Taking m=2 gives the permutation (1 2 4 3), which is odd, and thus cannot have a square root.
Even the most straight forward, logical, easily reasoned ones stump all LLMs I have access to, which is why I am so skeptical concerning emergence, reasoning and all this hype around “AGI”…
I just went to chatgpt.com and put into the chat box "Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?", and the very first answer I got (that is, I didn't go fishing here) was
The 9.99-pound bag of steel ingots is heavier than the 10.01-pound
bag of fluffy cotton by a small margin. Although the cotton may
appear larger due to its fluffy nature, the steel ingots are denser
and the weight of the steel bag is 9.99 pounds compared to the 10.01
pounds of cotton. So, the fluffy cotton weighs just a tiny bit more
than the steel ingots.
Which, despite getting it both right and wrong, must still be graded as a "fail".If you want to analyze these thing for their true capability, you need to make sure you're out of the training set... and most of the things that leap to your mind in 5 seconds are leaping to your mind precisely because they are either something you've seen quite often or something that you can easily think of and therefore many other people have easily thought of them as well. Get off the beaten path a bit and the math gets much less impressive.
so far o1-mini has bodied every task people are saying LLMs can’t do in this thread
I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".
(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)
Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.
You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.
perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini
Could you try playing with the exact numbers and/or substances?
Which is heavier: an 11kg bag of lint or a 20lb bag of gold?
https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9f17...
e: cool, this gets downvoted
> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.
>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.
> give me a query and i’ll ask it
Here's a query similar to one that I gave to Google Gemini (version unknown), which failed miserably:
---query---
Steeleye Span's version of the old broadsheet ballad "The Victory" begins the final verse with these lines:
Here's success unto the Victory / and crew of noble fame
and glory to the captain / bold Nelson was his name
What does the singer mean by these lines?
---end query---
Italicization is for the benefit of HN; I left that out of my prompt.
of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.
…but to your point, no idea if the artist intended some more obscure reference.
o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!
No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.
> o1-pro was also able to produce a relatively complete version of original source
...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.
> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.
Both terms are used. https://en.wikipedia.org/wiki/Broadside_ballad :
>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.
These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.
"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle. "and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory. "and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle. "bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary. The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.
What do you think the singer in "Friends in Low Places" meant in the toast he gave after crashing his ex-girlfriend's wedding?
And I saw the surprise and the fear in his eyes
when I took his glass of champagne
and I toasted you, said "Honey, we may be through
but you'll never hear me complain"
But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.
(The question has already provided every word of the toast that appears in the song.)
See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.
Sentience isn't really a related concept.
This is coming off my TWO DAY cooldown on o1 usage, which is frustrating.
I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.
Just a whole bunch of:
(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. All responses collected and saved to 'responses.txt'.
Script with one example set of params:
import openai
import time
import random
# Replace with your actual OpenAI API key
openai.api_key = "your-api-key"
# The question to be asked
question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"
# Number of times to ask the question
num_requests = 10
responses = []
for i in range(num_requests):
try:
# Generate a unique context using a random number or timestamp, this is to prevent prompt caching
random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"
# Call the Chat API with the random context added
response = openai.ChatCompletion.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
{"role": "user", "content": question}
],
temperature=2.0,
top_p=0.5,
max_tokens=100,
frequency_penalty=0.0,
presence_penalty=0.0
)
# Extract and store the response text
answer = response.choices[0].message["content"].strip()
responses.append(answer)
# Print progress
print(f"Response {i+1}: {answer}")
# Optional delay to avoid hitting rate limits
time.sleep(1)
except Exception as e:
print(f"An error occurred on iteration {i+1}: {e}")
# Save responses to a file for analysis
with open("responses.txt", "w", encoding="utf-8") as file:
file.write("\n".join(responses))
print("All responses collected and saved to 'responses.txt'.")
The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.
If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P
The questions as I understand it is whether the network learned enough of a simulacrum of the concept of weight to answer similar questions correctly.
If I didn’t know any better, I’d say OpenAI doesn’t want us doing these tests accurately and is trying to hide something.
Maybe it's different if you have a paid account?
If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.
When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.
It's really just confusing as hell.
> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
Does this proves he is not an intelligent being?
Is he stupid?
This he had a lapse? Would we judge his intelligence for that?
The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.
OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".
Getting this question wrong doesn't say much about the intelligence of humans, why would it say something about the AI?
Also, lots of smart humans can't do the freaking Putnam, it doesn't make them stupid. It makes them non-experts.
Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.
Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.
It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'
The people selling this could not make a car drive but now its AGI.
"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?
I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".
Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.
Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.
By that standard humans know almost nothing.
that said, i don’t think this is a good test - i’ve seen it circling on twitter for months and it is almost certainly trained on similar tasks
I mean, if I had OpenAI’s resources I’d have a team tasked with monitoring social to debug trending fuck-ups. (Before that: add compute time to frequently-asked novel queries.)
This could even be automated; LLMs can sentiment-analyze social media posts to surface ones that are critical of LLM outputs, then automatically extract features of the post to change things about the running model to improve similar results with no intervention.
> Yes, in the Netherlands, jenever (also known as genever) is the traditional spirit that represents the country. Jenever is a type of Dutch gin that has a distinctive flavor, often made from malt wine and flavored with juniper berries. It has a long history in the Netherlands, dating back to the 16th century, and is considered the precursor to modern gin.
> Jenever comes in two main types: oude (old) jenever, which has a richer, maltier flavor and is often aged in wooden casks, and jonge (young) jenever, which is lighter and has a more neutral taste. Both types can be enjoyed straight or used in cocktails.
> In addition to jenever, the Netherlands is also known for its liqueurs, such as advocaat, a rich and creamy drink made from eggs, sugar, and brandy, often enjoyed as a dessert or in cocktails. However, jenever remains the most iconic spirit associated with Dutch culture.
This is completely wrong. Jenever certainly is very Dutch, but no one would say it is iconic as the Dutch spirit. For example, if you asked up north in Friesland, they would say Berenburg.
This happens literally every time. Someone always says "ChatGPT can do this!", but then within one or two prompts, its gets it wrong.
It is equally possible to "reason" about things you already know, as about things you've just been told. In fact, the capacity to speculatively, without prompting attempt such reasoning is a big part of cognition.
From your comment it would seem that you are disputing jenever's popularity by saying jenever is more popular...
Perhaps it was a good faith mistake? If so, that would imply that the AI knows more about jenever than you?
For example, France has been trending towards beer more and more, and within a few decades they might be consuming more beer than wine. But even then, the French wouldn't slowly start to say beer represents France.
Furthermore, "just adding some herbs" does a large disservice to the flavor change of Berenburg. Jenever (aka jonge/unaged jenever) is straight-up vile. I've heard it described by expats as "having the worst elements of both cheap gin and cheap whisky".
Berenburg in comparison is spicy and vanilla-y and actually debatebly enjoyable.
Aged/oude jenever is much closer to Berenburg (or Berenburg to aged jenever), also with hints of vanilla and spices.
But, virtually no one except for dusty old men orders aged jenever. The kind ordered by far the most is jonge jenever, and then its only in a sense of "haha lets drink this terrible thing" or "let's get shitfaced quick".
If o1 supposedly "oneshots every question", it should have been aware of these nuances instead of just confidently assigning jenever as 'the' spirit of the Dutch.
The question in the prompt comes off to me as a sort of qualitative determination rather than asking about pure factual information (is there an officially designated spirit). As such I don't think it can necessarily be right or wrong.
Anyway, I'm not sure what you'd expect. In terms of acquisition of knowledge, LLMs fundamentally rely on a written corpus. Their knowledge of information that is passed through casual spoken conversation is limited. Sure, as human beings, we rely a great deal on the latter. But for an LLM to lack access to that information means that it's going to miss out on cultural nuances that are not widely expressed in writing. Much in the same way that a human adult can live in a foreign country for decades, speaking their adopted language quite fluently, but if they don't have kids of their own, they might be quite ignorant of that country's nursery rhymes and children's games, simply because they were never part of their acquired vocabulary and experience.
I was just proving the people wrong that were saying akin to that o1 was "oneshotting every question".
I completely understand from how LLMs work that they wouldn't be able to get this right. But then people shouldn't be proudly be pronouncing that o1 (or any model) is getting every question right, first time.
I have opened the question of why you thought jenever was not jenever, and your non-responsiveness I think compels the fact that AI was more correct in your contrived instance.
What's expected is an ability to identify trick questions, i.e., to recognize fundamental problems in the phrasing of a question rather than trying to provide a "helpful" answer at all costs.
This corresponds to one of the many reasons LLM output is banned on Stack Overflow.
But, like Zahlman points out, its a trick question, and instead of admitting it doesn't know or even prepending "I don't know for sure, but:", it just burps up its best-effort answer. There is no one spirit that represents The Netherlands. If a LLM is so good it "oneshots any question", it should realize it doesn't have a unanimous answer and tell me.
Using just a plain old search engine, for things like "national drink of the netherlands" and simlar queries, I am directed to Wikipedia's Jenever page as the top hit, and Wikipedia's list of national drinks lists Jenever and Heineken as the entries for the Netherlands. Search engines also give page after page of travel guides and blog posts, most of which list Jenever at or near the top of of their listings. One travel guide calls it "the most famous Dutch spirit and most famous Amsterdam liquor, Jenever, also spelled Genever or simply Dutch gin."
When asking an LLM to write a script for you, I would say 10 to 30 % of the time that it completely fails. Again, making up an API or just getting things straight up wrong.
Its very helpful, especially when starting from 0 with the beginner questions, but it fails in many scenarios.
I got a very different answer:
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots because 10.01 pounds is greater than 9.99 pounds. The material doesn't matter in this case; weight is the deciding factor.
What model returned your answer?
Let me think about this step by step:
1. We're comparing two masses: * Steel ingots: 9.99 pounds * Cotton: 10.01 pounds
2. The cotton bag has a mass of 10.01 pounds, while the steel bag has a mass of 9.99 pounds
3. While steel is much denser than cotton (meaning it takes up less space per pound), this doesn't affect the weight - a pound is a pound regardless of material
4. 10.01 is greater than 9.99 by 0.02 pounds
Therefore, the 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The cotton may take up much more space, but when we're just comparing weights, density doesn't matter - only the mass does.
GPT 4:
The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The type of material doesn’t affect the weight comparison; it’s purely a matter of which bag weighs more on the scale.
GPT 4o:
The 10.01-pound bag of fluffy cotton is heavier. Weight is independent of the material, so the bag of cotton’s 10.01 pounds outweighs the steel ingots’ 9.99 pounds.
GPT o1:
Since both weights are measured on the same scale (pounds), the 10.01-pound bag of cotton is heavier than the 9.99-pound bag of steel, despite steel being denser. The key is simply that 10.01 pounds exceeds 9.99 pounds—density doesn’t affect the total weight in this comparison.
If a system has the capability to solve problems of {parts1...parts_n}, then it only has that capability if irrelevant permutations {parts1..parts2'...parts_n} make no difference.
Its very obvious that such permutations can destory such apparent capabilities.
Please leave the premises, sir. We don't take kindly to luddites here.
There are other websites where two and three word comments are better received.
Testing it with more variations until one fails feels a bit like p-hacking. You'd need to engage in actual statistics to get reliable results from that, beyond "If I really try, I can make it fail". Which would be a completely different hypothesis than the one presented at the start
LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.
There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.
But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.
You can make mistakes and still reason. Very often people given the same premises will disagree in thier reasoning as we are doing right here.
Apple AI researchers released a paper on it. They say no.
People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.
> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.
I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.
It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.
According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").
In fact, in my use of o1 it's incredibly clear that it still has the same problems. It's incredibly common that the second I ask for someone even slightly outside the training set, it's more likely to "round" to some wrong solution in the training set, rather than use any sort of human-like reasoning to figure out the right answer (often the right answer isn't hard to get, just not found in a Google search).
May still get it wrong in more subtle ways, though. Personally, I think it'll continue to get physics wrong until someone builds it some robot arms so it can train on actually interactive physical spaces and behavior.
Nothing capable of reasoning would contradict itself so blatantly and in such a short span while failing to indicate any kind of uncertainty.
And failure modes of other types of reasoners do not need to be the same as the failure modes of humans.
Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?
Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?
Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?
Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?
Which is heavier, a golf ball made of steel or a baseball made of lithium?
In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.
The proper way to ask it would be to ask which object has more mass.
Q:
Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible
A: Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:
Buoyancy Effect
Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.
As a result: • Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.
Actual Weight
Weight is the gravitational force on an object and is given by:
Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).
Conclusion • True weight (in a vacuum): Both are identical at 981 N. • Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.
They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.
gpt-4o: https://chatgpt.com/share/67768221-6c60-8009-9988-671beadb5a...
o1-mini: https://chatgpt.com/share/67768231-6490-8009-89a6-f758f0116c...
o1: https://chatgpt.com/share/67768254-1280-8009-aac9-1a3b75ccb4...
"Which is heavier, a solid titanium box weighing 9.9 flubs, or an empty wooden box weighing 10.1 blobs, where 1 flub = 1 kg, and 1 blob is 1kg".
The answer: "Since 1 flub = 1 kg and 1 blob = 1 kg, the titanium box’s mass is 9.9 kg and the wooden box’s mass is 10.1 kg. Therefore, the wooden box (10.1 kg) is heavier."
Thought that was pretty impressive.
> The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Interestingly, that was the best answer out of all models and also a correct one. O1-preview doesn't allow setting the temperature, so it's hard to know where the real probabilities fall.
1: "The 10.01-pound bag of fluffy cotton would be heavier. Even though cotton is much less dense than steel, the weight of the cotton bag is 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag the heavier of the two by 0.02 pounds."
2: "The 10.01-pound bag of fluffy cotton is heavier. Even though the steel ingots are denser and more compact, the cotton weighs more at 10.01 pounds compared to the steel ingots' 9.99 pounds. The difference is only 0.02 pounds, but in terms of weight, the cotton bag is heavier."
3: "The 10.01-pound bag of fluffy cotton is heavier. Despite cotton being less dense than steel, the weight is determined by the total mass, not the material's density. So, since the cotton bag weighs 10.01 pounds and the steel ingot bag weighs 9.99 pounds, the cotton bag is heavier by 0.02 pounds."
4: "The 9.99-pound bag of steel ingots is heavier. It weighs 9.99 pounds, while the bag of fluffy cotton weighs 10.01 pounds. Despite the cotton being fluffier and taking up more space, the steel ingots are still lighter by 0.02 pounds."
To determine which bag is heavier when measured in air—the 9.99-pound bag of steel ingots or the 10.01-pound bag of fluffy cotton—we need to consider not just their masses but also the effect of air buoyancy on their apparent weights.
*Step 1: Calculate the Masses*
First, convert the given weights into masses using the conversion \(1 \text{ pound} = 0.453592 \text{ kg}\):
- Mass of steel bag, \(m_s = 9.99 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5310 \, \text{kg}\) - Mass of cotton bag, \(m_c = 10.01 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5401 \, \text{kg}\)
*Step 2: Calculate the Volumes*
Next, calculate their volumes using the densities:
- Density of steel, \(\rho_s \approx 8000 \, \text{kg/m}^3\) - Density of cotton (fluffy), \(\rho_c \approx 50 \, \text{kg/m}^3\)
- Volume of steel bag, \(V_s = \frac{m_s}{\rho_s} \approx \frac{4.5310 \, \text{kg}}{8000 \, \text{kg/m}^3} \approx 5.664 \times 10^{-4} \, \text{m}^3\) - Volume of cotton bag, \(V_c = \frac{m_c}{\rho_c} \approx \frac{4.5401 \, \text{kg}}{50 \, \text{kg/m}^3} \approx 0.090802 \, \text{m}^3\)
*Step 3: Calculate the Buoyant Forces*
Using the density of air \(\rho_{\text{air}} \approx 1.2 \, \text{kg/m}^3\) and acceleration due to gravity \(g = 9.81 \, \text{m/s}^2\):
- Buoyant force on steel bag, \(B_s = \rho_{\text{air}} \times V_s \times g \approx 1.2 \, \text{kg/m}^3 \times 5.664 \times 10^{-4} \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 0.00668 \, \text{N}\) - Buoyant force on cotton bag, \(B_c = \rho_{\text{air}} \times V_c \times g \approx 1.2 \, \text{kg/m}^3 \times 0.090802 \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 1.068 \, \text{N}\)
*Step 4: Calculate the Apparent Weights in Air*
Subtract the buoyant force from the actual gravitational force (mass times gravity):
- Apparent weight of steel bag, \(W_s' = m_s \times g - B_s \approx 4.5310 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 0.00668 \, \text{N} \approx 44.44 \, \text{N}\) - Apparent weight of cotton bag, \(W_c' = m_c \times g - B_c \approx 4.5401 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 1.068 \, \text{N} \approx 43.47 \, \text{N}\)
*Conclusion:*
Despite the cotton bag having slightly more mass and a greater weight in a vacuum, when measured in air, the steel bag is heavier due to the significantly smaller upward buoyant force acting on it compared to the cotton bag. This means that on a scale in air, the 9.99-pound bag of steel ingots will weigh more than the 10.01-pound bag of fluffy cotton.
prompt> Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton? Please state in your answer what's the difference in grams.
answer> The 10.01-pound bag of cotton is heavier. The difference is 0.02 pounds, which is roughly 9.07 grams (using 1 pound ≈ 453.59237 grams).
Reference • National Institute of Standards and Technology (NIST): Conversion Factors
Notably, it's not even necessarily 4o, their premiere "non-reasoning"-model, but likely the cheaper variant: With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
Without a ChatGPT subscription you can't select a specific model anymore, not even rate limited, as was previously possible.
It feels a bit like "if all you have is a hammer, everything looks like a nail", where we're trying to make LLMs do stuff which it isn't really designed to do.
Why don't we just limit LLMs to be an interface to use other tools (in a much more human way) and train them to be excellent at using tools. It would also make them more energy efficient.
But it's OK if we currently try to make them do as much as possible, not only to check where the limits are, but also to gain experience in developing them and for other reasons. We just shouldn't expect them to be really intelligent.
chatgpt will switch to python for some arithmetic with the result that you get floating point math issues when a 8yo will get the result right. I think "switch to a tool" still requires understanding of which tool to get a reliable result, which in turn means understanding the problem. It's an interesting issue.
Interestingly, the variation of this problem that I first encountered, personally, was "which weighs more, a pound of feathers or a pound of gold?"
This is a much more difficult question. The answer given to me was that the pound of feathers weighs more, because gold is measured in troy weight, and a troy pound consists of only 12 ounces compared to the 16 ounces in a pound avoirdupois.
And that's all true. Gold is measured in troy weight, feathers aren't, a troy pound consists of only 12 ounces, a pound avoirdupois consists of 16, and a pound avoirdupois weighs more than a troy pound does.
The problem with this answer is that it's not complete; it's just a coincidence that the ultimate result ("the feathers are heavier") is correct. Just as a pound avoirdupois weighs more than a troy pound, an ounce avoirdupois weighs less than a troy ounce. But this difference, even though it goes in the opposite direction, isn't enough to outweigh the difference between 16 vs 12 ounces per pound.
Without acknowledging the difference in the ounces, the official answer to the riddle is just as wrong as the naive answer is.
No idea if GP did it intentionally to further drift from training data, but steel doesn't count as a precious metal, so it messes up the riddle by putting the two weights in the same system.
...Have you really never encountered people who would reflexively say that?
1. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Despite the difference in material density, the key factor here is the weight itself, with 10.01 pounds being greater than 9.99 pounds, regardless of the substance.
2. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The weight of an object is directly determined by its mass, and in this case, the cotton bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag heavier by 0.02 pounds.
3. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots, even though the cotton is much less dense. The comparison is purely based on weight, and 10.01 pounds is greater than 9.99 pounds.
4. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. While steel is much denser than cotton, the weight is the determining factor here, and 10.01 pounds is slightly more than 9.99 pounds. So, the cotton bag weighs more.
5. The 10.01-pound bag of fluffy cotton is heavier. Despite the cotton being less dense and bulkier, the weight of the cotton is still 10.01 pounds, which is greater than the 9.99-pound bag of steel ingots.
https://chatgpt.com/share/67759d1a-1430-800b-a0a9-2c5f2ac02a...
An ounce of gold is heavier than an ounce of feathers, because the "ounce of gold" is a troy ounce, and the "ounce of feathers" is an avoirdupois ounce. But that shouldn't be true between feathers and bricks - they're both avoirdupois.
So using the word pound without qualification is ambiguous in contexts where it’s not clear whether mass or force is meant.
https://chatgpt.com/share/67759723-f008-800e-b0f3-9c81e656d6...
One might argue that it's impossible to compress air using known engineering, but that would be a different kind of answer.
I mean that ChatGPT had no questions about the container of the gas (does it have weight? is it tared out?) or about buoyancy. And it’s really rather sad that ChatGPT calculated the volume of air at atmospheric temperature and pressure and didn’t notice that it was nowhere near fitting into its supposed container. (At least 1.01lb of liquid nitrogen would fit.)
“Which is heavier: a bag with 9.9 pounds of fluff or a bag with 9.11 pounds of steel block?”
Try swapping the numbers and the materials, run each test at least 10x times.
* What’s heavier? 9.99 kg of super dense lead or 10 kg fluffy feathers?*
Let me think about this step by step:
1. We're comparing two masses: * 9.99 kg of lead * 10 kg of feathers
2. The density of the materials doesn't affect their mass: * While lead is indeed much denser than feathers * A kilogram is a unit of mass, not volume * 1 kg of any material has the same mass as 1 kg of any other material
3. We can directly compare the numbers: * 10 kg > 9.99 kg
Therefore, 10 kg of feathers is heavier than 9.99 kg of lead. The feathers would take up much more space (volume) due to being less dense, but they have a greater mass.
Q: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?
C: Let me think about this step by step:
A woman and her son are in an accident
The woman (the mother) dies
The boy is taken to hospital
A doctor says they can't operate because the boy is their son
But the mother is dead...
Ah! The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father. Alternatively, the doctor could be the boy's other mother in a same-sex relationship, though this wasn't the original intent of the traditional riddle.
More substantively: "This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father." is totally nonsensical.
The text is a gender (and meaning) inversion of the classic riddle to confuse LLMs. Even though Claude correctly CoTs that the doctor is the father, it seemingly then copypastas the explanation to the original riddle but swaps the gender, completely missing the point.
This is not code. Forgetting a semi colon will not make the output break. It thinks 'maybe they wrote he instead of she' and then gives options for both situations.
It is meant to solve real world situations where people might not type properly, it is not a word problem solving machine.
It identified the pattern, but it didn't understand the father is a man while the mother is a woman. If people assume doctors must be men, they wouldn't consider that the doctor could be the mother.
(Side note: this is an antiquated riddle/joke ... for me in particular because my mother is a doctor, but in general around half of new doctors are female, and think about Grey's Anatomy and other shows.)
"The doctor is the boy’s other parent—specifically his mother, who wasn’t in the accident."
``` I have two bags, one can hold a pound of gold and one can hold a pound of feathers. Which bag is heavier? ```
The point here a) the question really is a bit too vague, b) if you assume that each back is made of the same material and that each bag is just big enough to hold the contents, the bag for the feathers will need to be much bigger than the bag for the gold and thus would weigh more.
The first response was wrong because the model completely missed that the question was about the bags themselves:
``` They weigh the same.
A pound of any substance, regardless of its density or volume, weighs one pound. Gold is much denser than feathers, meaning a pound of gold occupies a much smaller volume. Feathers are very light and fluffy, so a pound of feathers takes up a much larger volume. ```
Asking a followup `But which bag is heavier?` resulted in recognition that this is a trick question, and an incorrect answer:
``` This is a classic trick question!
Even though the bags hold the same weight (one pound), the bag of gold is likely to be heavier.
Weight vs. Mass:
Weight is the force of gravity acting on an object. Mass is the amount of matter in an object. The Bag Itself:
The bag holding the gold, even if it's a lightweight material, will have some weight. The bag holding the feathers, while it might seem negligible, will also have some weight. Practical Considerations:
In reality, the bag holding the gold might be significantly heavier due to the weight of the bag itself. Therefore, while the gold and feathers have the same mass (one pound), the bag containing the gold is likely to be heavier in a practical sense. ```
One could go further into the linguistic nuance of saying "this can hold one pound of [substance]", which often implies that that's its maximum carrying capacity; this would actually make the "trick question" answer all the more correct, as a bag that is on the cusp of ripping when holding one pound of feathers would almost certainly rip when holding one pound of (much denser) gold.
The bag for the gold doesn't need anything substantial.
So, no, the chances are strongly on the side of the feather bag being heavier.
Also, if we model the mental state of the llm as a frazzled retail worker dealing with thousands of customers per second, the rote response is reasonable. As a dev, sometimes I get at annoyed at QA for a hyper narrow "trap" test case
Step-by-Step Reasoning 1. Clarify what is being compared The question asks us to compare two bags: one containing steel ingots weighing 9.99 pounds and another containing fluffy cotton weighing 10.01 pounds. 2. Check the actual weight Despite our intuition that steel is “heavier” than cotton by density, the problem specifically states the total weight in pounds of each bag. Since 9.99 < 10.01, the 10.01-pound bag of cotton is heavier in total mass than the 9.99-pound bag of steel. 3. No trick in the measurement Each weight is given in the same units (pounds). This is crucial because if different units or measurement standards were used (e.g., 9.99 kilograms versus 10.01 pounds), we would need to convert them to the same unit to compare. But here, both are simply in pounds. 4. Conclusion The bag of cotton weighing 10.01 pounds is heavier than the bag of steel weighing 9.99 pounds—even though steel is denser than cotton. The numeric value of the total weight is the deciding factor.
Answer
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots, because 10.01 pounds is strictly greater than 9.99 pounds.
Reference • Basic arithmetic: 10.01 is greater than 9.99. • For a playful twist on a similar concept, see any version of the riddle “What weighs more—a pound of feathers or a pound of lead?” In that classic riddle, both weigh the same; here, the numbers differ.
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Despite the significant difference in density and volume between steel and cotton, the weights provided clearly indicate that the cotton bag has a greater mass.
Summary:
Steel ingots: 9.99 pounds
Fluffy cotton: 10.01 pounds
Conclusion: The 10.01-pound bag of cotton is heavier.
"To determine which is heavier, we need to compare the weights of the two objects.
The steel ingots weigh 9.99 pounds, and the fluffy cotton weighs 10.01 pounds.
Since 10.01 pounds is greater than 9.99 pounds, the bag of fluffy cotton is heavier than the bag of steel ingots, despite the cotton being "fluffy" and the steel being dense. Weight is a measure of the amount of mass in an object, and in this case, the cotton has a slightly greater mass than the steel."
-> A new benchmark emerges with a novel evaluation method.
-> A new model saturates the benchmark by acquiring the novel “behavior.”
-> A new benchmark introduces yet another layer of novelty.
-> Models initially fail until a lab discovers how to acquire the new behavior.
Case in point: OpenAI tackled this last step by introducing a paradigm called deliberative alignment to tackle some of the ARC benchmarks. [1]
Alongside all this technical iteration, there’s a parallel cycle of product iteration, aiming to generate $ by selling intelligent software. The trillion $ questions are around finding the right iterations on both technical and product dimensions.
Also, the correct title of this contribution is: Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning
99% of studies claiming some out of distribution failure of an LLM uses a model already made irrelevant by SOTA. These kinds of studies, with long throughputs and review periods, are not the best format to make salient points given the speed at which the SOTA horizon progresses
There are 6 “a”s in the sentence: “How many ‘a’ in this sentence?”
https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa2...
Whereas the typical “strawberry” variant is now correct.
There are 3 “r”s in the word “strawberry.”
Clearly the lesson wasn’t learned, the model was just trained on people highlighting this failure case.
hence we continually played whack-a-mole with bugs. we would squash one bug, and another one would appear.
same with llms, squash one problem with a data-fix, and another one pops-up.
For example, the output for "how many x's are there in xaaax" is 3.
https://chatgpt.com/share/677591fe-aa58-800e-9e7a-81870387be...
https://chatgpt.com/share/6775cb37-4198-8007-82cb-e897220827...
https://chatgpt.com/share/6775c9a6-8cec-8007-b709-3431e7a2b2...
Basically one feed forward is not Turing complete, but autoregressive (feeding previous output back into itself) are Turing complete.
Regardless, I’d love if you would explain a bit more why the transformer internals make this problem so difficult?
https://arxiv.org/html/2407.15160v2
The Expressive Power of Transformers with Chain of Thought
https://arxiv.org/html/2310.07923v5
Transformer needs to retrieve letters per each token while forced to keep internal representation still aligned in length with the base tokens (each token also has finite embedding, while made out of multiple letters), and then it needs to count the letters within misaligned representation.
Autoregressive mode completely alleviate the problem as it can align its internal representation with the letters and it can just keep explicit sequential count.
BTW - humans also can't count without resorting to sequential process.
I think a fair conclusion here is reasoning models are still really good at solving very difficult math and competitive programming problems, but just better at ones they have seen before.
"First break the problem down into known facts, then pull relevant world knowledge, then bring it all together to assess the problem from multiple angles and make a conclusion. Do not immediately just use the first obvious conclusion."
You're gonna get a lot better responses. I suspect this is more of a "look! LLMs make bad kneejerk responses when we try to trick them from what they were expecting!" rather than "Look! They aren't even smart reasoners, they can't even figure out these problems without memorizing!"
They do memorize. But that cuts both ways - making problems very close to the memorized one mess with their perception, the same way humans will instinctually respond to something that looks like a face before stepping back and assessing.
Q: I was heading to Aberdeen from London. On my way I passed seven wives, each wife had seven sacks, and in each sack there were seven cats and each cat had seven fish. How many were going to London?
A: This riddle is a play on words, and the answer is hidden in the phrasing! You mentioned you were heading to Aberdeen from London, but you didn’t say anything about the seven wives, their sacks, cats, or fish actually being headed to London.
The only one going to London is you.
So the answer is: 1 person (you) are going to London.
> I was heading to Boston from New York. On my way I passed six clerks, each clerk had six socks, and in each sock there were six skinks and each skink had six sticks. How many were going to Boston?
The answer is... "I" was going to Boston! The question asks how many were going to Boston, and the answer is the person being referred to, which is "I". The rest of the information about the clerks, socks, skinks, and sticks is just a clever distraction!
[Of course, it also got your exact prompt right.]