OpenAI trained o1 to pick better steps in its chain of thought. (The moat is the dataset they used to do that.)
I mean, it obviously wasn't, did you read the thing we're commenting on? n.b. At this point, you have all you need to replicate it. Far shy of "outright fraud", though, I'm sure there's a bailey for that motte.
The results might be perfectly reproducible, but their reputation is completely burned. This is not how you launch your company.
Even if you don't care they didn't release anything at all of substance before the o1 launch. They didn't release usable weights, they didn't ship a working hosted model of their own. So no, they didn't beat OpenAI to anything.
s/strong/extremely weak from my perspective, also, see article
> They didn't release usable weights,
Yes they did. They just weren't benchmarking the same as the initial claim.
> they didn't ship a working hosted model of their own.
Yes they did. They just weren't benchmarking the same as the initial claim.
> So no, they didn't beat OpenAI to anything.
Not sure where the idea they "beat OpenAI" is coming from, certainly not from me. I agree they did not.
> It's indisputably fraud.
This is indisputably incorrect, as I am disputing it.
Happy to talk it out, don't take my shortness as being disagreeable. In general, people handwave about tokenizer[1] or "Claude" missing in a response[2]. I honestly expected the HN thread here to be far more insightful, instead, I'm seeing that its indisputable it was fraud, based on repeating a couple observations gooner teens made last week, then made vast conclusions on. Which were obviously wrong if you looked at it as an engineer.
[1] no one can get the actual tokens out of an API. gooner local LLM stans were setting max tokens to some number <= 10, asking the same question of both, and seeing answers of similar length. This is mundane and expected, especially in english, at such a short length. I expected technical observers to, at least even if they don't grok tokenization, to note they weren't able to get the same responses with temperature = 0.0.
[2] covered in article
But what's great about the passage of time is that people can actually take what's presented in the article and try to replicate the benchmarks. And now that it's Friday, October 4th, and we've got this gem:
https://x.com/mattshumer_/status/1842313328166907995
So frankly it's still a fraud, now because even after the postmortem the results are still not reproducible. That's the whole point, right? That it does what it says on the tin. And it doesn't. This whole process could be a shell script that downloads and runs. They've had more time than they should need. Now, it's gone from a shell game to plain old academic dishonesty. If this was a published paper, it would be ripe for retraction.
Can’t upload exact weights he had on his computer. The guy runs AI hosting/inference/training company - can’t upload weights he has!
Original benchmark harness wasn’t shared, but had a bug that conveniently boosted model results.
API somehow mysteriously censors model name and tokenizer is exact match to Claude.
Nothing would stop him from uploading all the weights, I suppose...
Deleting files is very much a thing
In most established verticals, such a cartoonish scam would be dead on arrival. But apparently generative AI is still not mature enough to just move past this kind of garbage in a clean break.
This isn't said aggressively or to label, but rather, to provide some context that it's probably not nearly as simple as you are suggesting: this thread looks like a bunch of confused engineers linking drama threads from laymen on Twitter/Reddit to eachother, seeing pitchforks, and getting out their own. Meanwhile, the harsh conclusions they jump to are belied by A) having engineering knowledge _and_ looking into their claims B) reading TFA
That's because there is nothing better today, and nothing like it in the history.
Note that there was additional proof, besides missing the string "Claude", by matching the max number of tokens the model was able to produce. This is more technical, but chatGPT, Claude, Llama all have different tokenizers, so words are be broken up into different sections. The API consistently did NOT match the base model tokenizer (Llama), instead, producing the same number of tokens as Claude.
Companies and individuals should probably avoid GlaiveAI and Matt Shumer less they get scammed too.
and as for the fraud part...it was an opensource model release that did not meet the claimed benchmarks when people tried to replicate it
The model being open-source doesn't mean what they could have gotten away with, or tried to, isn't fraud.
As for corruption, I don't believe the excuse "yes file corruption happens". They're model weights. If this was trained (in real life) it was done on some serious hardware with disks with error correction. They weren't storing the checkpoints on microSD cards. It's certainly possible that there was really unfortunate luck and there was corruption, but I don't find that excuse to be plausible. Especially when this is your business (and launch!)
* wrongful or criminal deception intended to result in financial or personal gain.
* a person or thing intended to deceive others, typically by unjustifiably claiming or being credited with accomplishments or qualities.
Since they were advertising GlaiveAI as this magical source of data where they trained a model that performed better than Claude and chatGPT, I think this firmly falls into that camp! Your definitions may be different than mine.
if you don't have a twitter account: https://threadreaderapp.com/thread/1832933768031588622.html
Didn't these "few tokenizer related tests" prove the API was using Claude's tokenizer instead of Llama's, based on how words were being divided into tokens?
That's a hard one to explain (it doesn't appear they're even trying to).
Reflection 70B, the top open-source model https://news.ycombinator.com/item?id=41459781
Confirmed: Reflection 70B's official API is a wrapper for Sonnet 3.5 https://news.ycombinator.com/item?id=41484981
Yes they could've switched it themselves too.
I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else. I am not going to mention another similar category of technology in this regard, just to stay "politically correct" for this site.
I hope so too - it's been four years since GPT-3 came out and I haven't found a single serious time-saving application for the technology.
If someone doesn't start making money with LLMs soon, then it will only be the scammers who benefit!
I did a standard non-middleware lm_eval_harness and got 0.3214 on gpqa_main_zeroshot WITH the systemprompt and 0.3616 without the systemprompt.
Have not ran it with the middleware yet that's supposed to do the substraction. Now, if that adds 20% to the score, that would be a huge deal, but it would also roughly match the jump from gpt4o to o1-preview that they got in gpqa_diamond.