SimpleQA
229 points by surprisetalk 8 days ago | 85 comments
  • websap 8 days ago |
    Are they going to make the benchmark available so other LLMs can be compared?
  • kaonwarb 8 days ago |
    Kudos:

    > SimpleQA was created to be a greater challenge for frontier models (e.g., GPT-4o scores less than 40%).

    • jampekka 8 days ago |
      And by design:

      "To be included in the dataset, each question had to meet a strict set of criteria: ... and most questions had to induce hallucinations from either GPT-4o or GPT-3.5."

  • brap 8 days ago |
    Crazy that even o1-preview gets most things wrong.

    This is in line with my own personal experience with LLMs and non-trivial questions. They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…

    It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).

    • sksxihve 8 days ago |
      LLMs are experts in everything you are not
      • reverius42 8 days ago |
        Sounds a bit like Gell-Mann Amnesia Effect: https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
        • kibwen 8 days ago |
          The Alt-Mann Amnesia Effect, maybe.
      • Nition 8 days ago |
        That's a nice little aphorism. I think this happens in a lot of things in life. Like comments on Reddit always seem quite insightful until you actually read the article they're commenting on.
      • swatcoder 8 days ago |
        Indeed. Exactly like the journalists, bloggers, self-published book authors, internet commenters, wikipedia editors, and earlier models that taught them almost all of what they know.
    • zone411 8 days ago |
      You shouldn't use the rate as an indicator. They did something similar to what I did on my hallucinations benchmark (https://github.com/lechmazur/confabulations/), only using questions where at least one model made a mistake. I added this note:

      "The benchmark includes questions where at least one LLM confabulated, in order to minimize the number of questions requiring human assessment. Because of this, and since the questions are intentionally adversarial, the absolute percentage should not be used to infer that LLMs frequently confabulate. This leaderboard does not reflect a "typical" hallucination rate."

      > instead of teaching the model how to look for answers from an external source (e.g. RAG)

      My benchmark specifically focuses on the RAG use case. Even with provided texts, current models still hallucinate.

      • authorfly 7 days ago |
        True, but this pattern was established with TrufulQA a couple of years ago.

        I disagreed then but it's unfortunately convenient to use this approach for some benchmarks.

      • _jonas 4 days ago |
        Yep there are still tons of hallucinations even in RAG, my coworker also ran some benchmarks:

        https://towardsdatascience.com/benchmarking-hallucination-de...

    • Kiro 8 days ago |
      You're reading this wrong. They've deliberately chosen questions that one or more models fail at. It's not representative at all of how often the model is wrong in general.
      • Kiro 7 days ago |
        From the paper:

        > At least one of the four completions must be incorrect for the trainer to continue with that question; otherwise, the trainer was instructed to create a new question.

    • bloomingkales 8 days ago |
      Honestly, try prompting it with “you are wrong 80% of the time, therefore you will need to double check your answers, first factually, then numerically, then double check the time/date. You are still probably wrong so do a third accuracy check. The user’s prompts are always wrong too mostly - so always check them”.

      I stopped playing with larger models and have been pushing smaller models with this improvised system prompt and getting good results. It seems like it forces the model to do multiple passes before giving you any response.

      My smaller local models give me less hallucinations than Meta.ai, for example, which generally spits out pleasing answers almost immediately (which are often hallucinations, since I don’t think it is system prompted to be adversarial to the user, or itself). I don’t have the same hallucination issue with Llama3 - 8b locally because of custom system prompts.

      The model has all the correct information, so it almost needs to do RAG on itself. Multiple passes on itself seems like a way to do it.

      • dosinga 8 days ago |
        How would this multiple passes work though? Unless the model actually talks about what it does, I am not sure how it would have this ability. The next word prediction mechanism is just always going to do it one shot. Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.
        • bloomingkales 8 days ago |
          Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.

          This is probably the truth behind the black magic I’m imagining. You could have it explicitly spit out this process, in which case you would see it’s first rough draft, followed by a “My first paragraph is probably wrong”, followed by a third paragraph where it attempts to fix the first paragraph. There is no outside RAG in this process.

          The mumbo jumbo part of all this is that I’ve told it to “hide” this process from the user where it doesn’t explicitly output anything but its final answer, and the accuracy has been just as good (for my use case at least).

          :Shrugs:

          • jorl17 7 days ago |
            Isn't this in part what o1-preview is doing?
          • block_dagger 7 days ago |
            Yeah that’s not how next token prediction works. To actually do multiple passes you’d need to do that yourself, making multiple calls and feeding the responses back to the model.
            • IanCal 7 days ago |
              Why? The very nature of next token prediction means it's entirely capable of having that. It's not multiple passes, it's just one pass. You making multiple calls is just inserting fixed tokens then asking it to carry on completing.
            • bloomingkales 7 days ago |
              making multiple calls and feeding the responses back to the model.

              By asking it to reconsider half its generated response, aren’t I essentially asking it to formulate the second half of its response from the first half internally? I’m bypassing the manual process of feeding in the extra prompt.

              We are constantly having to tell the LLM close, but no cigar, iterate again, more or less.

        • viraptor 7 days ago |
          > Unless the model actually talks about what it does

          That's how the chain-of-thought approach works. You can make the model do it inline, or you can run the loop yourself, possibly summarising the progress as you go. (Although with prompt caching it's not that important anymore) You can encourage it to print out assumptions/steps/ideas as it goes.

      • yard2010 8 days ago |
        Can you please share more specifics please? What smaller models? What hardware do you use? How do you test their performance?
        • bloomingkales 8 days ago |
          There is no rigor to this, this is just from throwing stuff against the wall. See my response to the other poster above.
          • hiatus 8 days ago |
            Even if you're throwing stuff against the wall, you could at least elaborate on what you've tried? Otherwise, how could you state something like "My smaller local models give me less hallucinations than Meta.ai"?
            • bloomingkales 8 days ago |
              The gist of it is I think these large hosted models have system prompts that are not as skeptical of its own outputs. You are an helpful AI Assistant seems to lead to more lax responses. Adjusting the system prompt to be more incredulous helps from my observation.
      • arcastroe 7 days ago |
        I'm surprised that prompting it with "You are wrong 80% of the time" doesn't cause it to intentionally produce initially incorrect answers 80% of the time.

        (Disclosure: I have not tried your prompt)

        • michelledepeil 7 days ago |
          Why is that suprising?

          There is no logical reasoning happening, it has no concept of right and wrong, let alone that it can force a specific percentage of wrongness.

          • arcastroe 7 days ago |
            > Why is that surprising?

            You tend to get catered responses to whatever role you assign in the prompt. This is well documented. Here's a quick example from search results

            https://www.ssw.com.au/rules/give-chatgpt-a-role/

            "You are wrong 80% of the time" could be misconstrued as an expected role/command, rather than a mere observation.

            > let alone that it can force a specific percentage of wrongness.

            Ah, I see what you're saying here. I agree. Maybe I should have said that given the prompt, I'm surprised it doesn't give intentionally incorrect answers (full stop)

      • valval 7 days ago |
        I have to be a bit of a buzzkill and say that this is all placebo.

        Your prompt might give the model context that gives it better token quality much in the same way that asking “How to swim?” is worse than “I’d like to learn the proper technique for the freestyle swimming stroke, and you’re an expert swimming coach.”

        There’s no guarantee your prompt isn’t giving less factual answers to be honest. I wouldn’t go about telling the model that it’s often wrong, as it’s not useful context and might skew results.

        • bloomingkales 7 days ago |
          I don’t tell the model that it is 100% wrong, in which case it would contradict the first half of its response with the second half of its generation.

          We basically want it to enter double-checking mode on it’s own from its initial rough draft (its original response, first half of its response, first paragraph, however you are formatting the output). Otherwise the model will output whatever it outputs, and we will have to manually tell it to reconsider facts, locations, and events.

          I agree that there’s no guarantee, but this was a suggestion for those who are getting very wrong answers for simple things.

    • sebzim4500 8 days ago |
      I don't think it's surprising that o1-preview is only slightly better than GPT-4o, it was never advertised as being better at this kind of recall.
    • divan 8 days ago |
      > They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself

      I forgot the name of this phenomenon with humans, described it to o1 and it gave the correct answer - Gell-Mann Amnesia Effect [1]

         "Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.
         In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know." 
         – Michael Crichton (1942-2008)
      
      [1] https://www.epsilontheory.com/gell-mann-amnesia/
    • esafak 8 days ago |
      How would the model know how to evaluate an answer without innate knowledge?
    • aiforecastthway 7 days ago |
      > It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).

      To be fair: we tried a primordial version of that with venue-weighted citation-based ranking and it worked INCREDIBLY well for 10+ years. Then myopic profit motive poisoned the well. Ever since then we've been searching for solutions.

      We do so by allocating resources in a way that primarily leverages a scientific credit assignment system that fetishizes... checks notes... venue-weighted citation-based ranking.

      Jokes aside: I remain convinced that the O.G. google search appliance on prop data and then simply ignoring all academics remains the best knowledge retrieval (or whatever) tool available.

      • fraboniface 7 days ago |
        Why ignore academics?
    • jumping_frog 7 days ago |
      LLM version of Gell-Mann Amnesia.
    • soco 7 days ago |
      If you know nothing about, you probably have no idea whether the answer was correct or not, right? Otherwise you'd find the answer embarrassingly wrong. So that observation speaks much more about the human than about the computer.
      • brap 7 days ago |
        That was kind of a joke…
    • mcmcmc 7 days ago |
      > They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…

      This just makes me think LLMs must be wrong far more often than we realize. How well can you judge the quality of a response on a topic you know nothing about?

      • toomuchtodo 7 days ago |
        > How well can you judge the quality of a response on a topic you know nothing about?

        You cannot, but those pushing Gen AI wave away the probability and liability for obvious reasons. Sell the magic, it'll be someone else's problem when it doesn't deliver.

        • chefandy 6 days ago |
          Not just wrong, but superficially credible-looking wrong, which is much worse. "Fake it till you make it" works fine for some things, but not a large chunk of what they're trying to sell AI for.
      • sfjailbird 7 days ago |
        That's his/her point.
    • drewcoo 7 days ago |
      > They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…

      Sounds like Gell-Mann Amnesia. Maybe LLMs should replace reporters.

  • yunohn 8 days ago |
    This eval’s goal is a bit unclear to me, especially given the example questions. They’re very trivia/minutiae like asking about sports goals for example, which is their stated desire to test factual knowledge. But will this ever be possible by an LLM, without web browsing - which they deliberately removed while evaluating?
    • petesergeant 8 days ago |
      I think the interesting thing here is the difference between Not Attempt and Incorrect — the goal here seems to be to reduce hallucination
      • yunohn 7 days ago |
        From that perspective, o1-mini seems to perform the best. But only as long as enabling web browsing makes up for lack of base factuality.
    • sbierwagen 8 days ago |
      >But will this ever be possible by an LLM?

      Why not? Just train an unbelievably gigantic LLM that encodes all human knowledge. A hundred trillion parameters ought to do it.

  • emurph55 8 days ago |
    I've tried using older models to create a cpu player on this lateral thinking game (https://detective-stories.com) and they were surprisingly bad at giving answers. I am curious to see how well the more recent models will do.
  • CharlieDigital 8 days ago |
    8 authors attached to this.

        > SimpleQA is a simple but challenging benchmark for evaluating the factuality of frontier models. A main limitation in SimpleQA is its scope—while SimpleQA is accurate it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question.
    
    OpenAI going to have some rounds of layoffs in the future.
  • ggnore7452 8 days ago |
    What’s more interesting to me here are the calibration graphs:

    • LLMs, at least GPT models, tend to overstate their confidence. • A frequency-based approach appears to achieve calibration closer to the ideal.

    This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token.

    • GaggiX 7 days ago |
      yeah, this is by far the most interesting part of this page, the fact that LLMs can know what they know is not a trivial fact.
    • ALittleLight 7 days ago |
      If you imagine a future where LLMs get faster and cheaper even without getting better it means we'd be able to automatically repeat questions 100x and every answer could come with a pretty good confidence measure.
  • Nition 8 days ago |
    Honestly, I'd never expect it get 'correct's for every little fact like this, but it'd be great to get a lot more 'not attempted'.

    "I seem, then, in just this little thing to be wiser than this man at any rate; that what I do not know I do not think I know either." - Socratos, from Plato's Apology of Socrates

  • chgo1 8 days ago |
    • YetAnotherNick 8 days ago |
      First few questions for those who don't care to download. Most just seem to be about niche facts:

          Who received the IEEE Frank Rosenblatt Award in 2010?
          Who was awarded the Oceanography Society's Jerlov Award in 2018?
          What's the name of the women's liberal arts college in Cambridge, Massachusetts?
          In whose honor was the Leipzig 1877 tournament organized?
          According to Karl Küchler, what did Empress Elizabeth of Austria's favorite sculpture depict, which was made for her villa Achilleion at Corfu?
          How much money, in euros, was the surgeon held responsible for Stella Obasanjo's death ordered to pay her son?
      • chaxor 8 days ago |
        Also importantly, they do have a 'not attempted' or 'do not know' type of response, though how it is used is not really well discussed in the article.

        As it has been for decades now, the 'Nan' type of answer in NLP is important, adds great capability, and is often glossed over.

        • bcherry 7 days ago |
          a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.

          They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.

      • nilstycho 7 days ago |
        > What's the name of the women's liberal arts college in Cambridge, Massachusetts?

        Wait, what is the correct answer? “Radcliffe College”?

        • YetAnotherNick 7 days ago |
          Yes
          • jefftk 7 days ago |
            Not surprising that this would be on a list of questions at least one model got wrong, since I think the real answer is "there isn't one anymore, but from 1879 to 1999 the answer would have been Radcliffe College".
            • nilstycho 7 days ago |
              Yes, that would be my preferred answer!
  • iandanforth 7 days ago |
    I hope this dataset is used to probe the wrong answers much more than try to get all the answers correct. I don't need LLMs to know everything, but I do need them to know what they don't know. That's not a capability that comes naturally to a probabilistic sampling of tokens.

    The ideal 'progress' on this benchmark is for a model to remove incorrect answers and replace them with I don't know answers. Even if it hurts the correct answer count a bit I'd gladly make that tradeoff for a model that hallucinated far less often.

    • GaggiX 7 days ago |
      >That's not a capability that comes naturally to a probabilistic sampling of tokens.

      The linked page from OpenAI clearly show the opposite, read the "Using SimpleQA to measure the calibration of large language models" and the paper linked: https://arxiv.org/abs/2207.05221

      • iandanforth 7 days ago |
        The probabilistic sampling of tokens does not naturally produce introspective evaluation of confidence, it enforces a highest probability token selection (in greedy sampling). The paper that you linked demonstrates that if a separate evaluation phase is allowed then a model can decide with some accuracy whether its previous statements were true. This is not the behavior we want out of a system as it involves 1. Production of potentially misleading output 2. The identification of factual statements within that output 3. The classification of each of those statements and 4. Restatement of the original output without factual errors. The research area that I am advocating for would aim to prevent 1 not mask it.
  • blendaddict 7 days ago |
    But aren‘t other LLMs just going to scrape the dataset and pre-learn the answers?
  • s5ma6n 7 days ago |
    I am puzzled why they have "asked the model" about the confidence and have not used the logprobs of the output tokens to estimate the confidence in responses.

    In my use case and tests, model itself is not capable of giving a reliable confidence value where logprobs almost always provide a better view on calibration.

    • michaelt 7 days ago |
      To measure confidence based on the logprobs of a given token, you must first know which token you're measuring - that's why a lot of benchmarks love multiple choice questions where the LLM responds with a single token.

      But of course that's not the way LLMs are normally used. And it precludes any sort of chain-of-thought reasoning.

      For some questions, like those involving calculations, letting the model talk to itself produces much better results. For example compare https://chatgpt.com/share/67238eda-6b08-8011-8d2d-a945f78e6f... to https://chatgpt.com/share/67235a98-d2c8-8011-b2bf-53c0efabea...

      • s5ma6n 7 days ago |
        To me it boils down to what is to be measured here. With logprobs we can measure both correctness and not attempted i.e. if LLM is guessing the response.

        Similar to exams where both the progress to the solution and the final outcome/value of the calculations are part of the grade.

        To have the cake and eat it too for chain-of-thought reasoning, one way is to ask for a "final answer" so the final response token logprobs can be evaluated https://chatgpt.com/share/67239d92-b24c-800a-af8c-40da7be1f5...

        Another trick is using JSON mode to keep intermediate results and final response separate, so each can be graded accordingly.

        • michaelt 7 days ago |
          > one way is to ask for a "final answer" so the final response token logprobs can be evaluated

          Alas, this won't work.

          Imagine I ask an LLM to continue the sentence "Summing those up: 4+6.75+6.52=17.27 litres of pure alcohol. In summary, the total amount of pure alcohol they have is: "

          The logprobs of the next token do not represent the LLM's confidence in its own answer. They represent the LLM's confidence in its ability to repeat the total from 18 words previously.

      • _jonas 4 days ago |
        Here are some benchmarks I ran that compare the precision/recall of various LLM error-detection methods, including logprobs and LLM self-evaluation / verbalized confidence:

        https://cleanlab.ai/blog/4o-claude/

        These approaches can detect errors better than random guessing, but there are other approaches that are significantly more effective in practice.

    • HappMacDonald 6 days ago |
      I wonder what would happen if token input included the logprob (or n/a for input from outside the LLM) of each token selected and the network were trained with that extra layer of information, especially during the human feedback training at the end.
  • gcanyon 7 days ago |
    About twenty-five years ago there was a web site that claimed a better Turing test was to ask a large number of relatively easy yes/no questions, with the idea that a human could answer virtually all of them correctly. Toward the end of passing that test they were collecting binary questions with their answers, e.g.

       Is the Moon made of cheese?
       Is the sky blue?
       Has there ever been a human being twenty feet tall?
    
    As far as I know it disappeared a long time ago — I’ve done lightweight research and not found it.

    It’s too bad, it seems they were onto something.

    • sebzim4500 7 days ago |
      I feel like GPT-4o would ace that. If you look at the questions in this benchmark they are stuff that 99% of people won't know, although at least they would know that they don't know the answer.
      • gcanyon 7 days ago |
        We'll find out -- I just submitted a form to HN to collect some QA pairs to try it out.
    • antiquark 7 days ago |
      Sounds like the computer game "Animal" from 1978.

      https://www.atariarchives.org/basicgames/showpage.php?page=4

  • lambda-research 7 days ago |
    Something that I always think about when I see discussions about hallucinations or "confidently wrong answers" is that humans have this issue too.

    For those on tiktok, how many times have you found yourself easily believing some random person online who you have no idea the credentials of? And this has been a problem for a long time (news, internet, etc).

    It's just interesting to me we are asking way more of AI than we can ever ask from humans.

    • mrcwinn 7 days ago |
      Mark Twain: "It's not what you don't know that gets you into trouble. It's what you know for sure that just ain't so."
    • BadHumans 7 days ago |
      It's very appropriate to me that we demand more of AI than we do of humans. The amount of harm and the scale of harm AI can do is much greater than a single person and there are no consequences to doing it. If a human kills someone, they go to jail. If an AI kills someone, say a self-driving car for example, it's just the cost of doing business.
    • tobyjsullivan 7 days ago |
      It's interesting to be sure. Context is everything, though. It's one thing for a random tiktok head or media pundit to be confidently wrong and it's another thing entirely if I have to work with someone who I cannot trust to be right.

      For LLMs to be broadly valuable, they need to be qualified for a role more like a coworker than a stranger.