Also on hugging face https://huggingface.co/microsoft/phi-4
Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:
`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`
(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)
[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes
For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well
In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".
Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.
Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].
[1] https://pvq.app/leaderboard
Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...
The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models.
It should be noted that Mixtral 8x7B didn't grade its own model very high at 11th, it's standout was grading Microsoft's WizardLM2 model pretty high at #2. Although it's not entirely without merit as at the time of release it was Microsoft's most advanced model and the best opensource LLM available [1]. Which we also found generated great high quality answers which I'm surprised it's not more used as it's only OpenRouter's 15th most used model this month [2], although it's received very little marketing behind it, essentially just an announcement blog post.
Whilst nothing is perfect we're happy with the Grading system as it's still able to identify good answers from bad ones, good models from bad ones and which topics models perform poorly on. Some of the grades are surprising since we have prejudices on where models should rank before the results are concluded, which is also why it's important to have multiple independent benchmarks, especially benchmarks that LLMs aren't optimized for as I've often been disappointed by how some models perform in practice vs how well they perform in benchmarks.
Either way you can inspect the different answers from the different models yourself by paging through the popular questions [3]:
[1] https://wizardlm.github.io/WizardLM2/
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...
[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...
[2] https://github.com/google-research/google-research/blob/mast...
> I think the strategy is now offer cheap and performant infra to run the models.
Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?
That is the reason they are making products so that people stay on the platform.
This means that in a world where AWS/Azure/GCP all compete in the compute and the models themselves are commodities, AI isn't a product, it's a feature of every product. In that world, what is OpenAI doing besides being an unnecessary middleman to Azure?
I'd agree there isn't much money in it. OpenAI should probably milk the revenue they get now and make hay while the sun is shining. But their apparent strategy is to bet it all on finding another breakthrough similar to the switch from text completion to a chat interface
OpenAI isn't losing yet because their models are still marginally better and they have a lot of inertia, but their API isn't going to save them.
> But their apparent strategy is to bet it all on finding another breakthrough similar to the switch from text completion to a chat interface
I'm still convinced that their strategy is to find an exit ASAP and let Altman cash out. He's playing up AGI because it's the only possible way that "AI" becomes a product in its own right so investors need to hear that that's the goal, but I think he knows full well it's not in reach and he can only keep the con going so long. An exit is the most profitable way out for him.
I don't really think they do, because to me it seemed pretty much since GPT-1, that having callbacks to run python and query google, having "inner dialog" before summarizing an answer and a dozen more simple improvements like this are quite obvious things to do, that nobody just actually implemented (yet). And if some of them are not obvious per se, they are pretty obvious in the hindsight. But, yeah, it's debatable.
I must admit though, that I doubt that this obvious weakness is not obvious to the stakeholders. I have no idea what the plan is, maybe what they gonna have that Anthropic doesn't is gonna be a nuclear reactor. Like, honestly, all we are pretending to be forward-thinking analysts here, but in reality I couldn't figure out that Musk's "investment" into Twitter is literally politics at the time of it happening. Even though I was sure there is some plan, I couldn't say what it is, and I don't remember anybody in these threads expressing clearly enough what is quite obvious in the hindsight. Neither did all these people like Matt Levine, who are actually paid for their shitposting: I mostly remember them making fun of Musk "doing stupid stuff and finding out" and calling it a "toy".
I could switch to a different provider if I needed to maybe with cheaper pricing or better models but that doesn't mean OpenAI doesn't offer a "product".
Their unique value-adds are the Chat GPT brand, being the "default destination" when people want AI, as well as all the "extra features" they add on top of raw LLMs, like the ability to do internet searches, recall facts about you from previous conversations, present data in a nice, interactive way by writing a react app, call down to Python or Wolfram Alpha for arithmetic etc.
I wouldn't be surprised if they eventually stop developing their own models and start using the best ones available at any given time.
Default destination for many is still just Google, and they've added AI to their searches. AI chat boxes are shoehorned into a ton of applications and at the end of the day it'll go to the most accessible one for people. This is why AI in Windows or in your Web Browser or on your phone is a huge goal.
As far as extra features, chat GPT is a good default, but they're severely lacking compared to most other solutions out there.
For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.
MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing
Based on their business moves in recent history, I’d guess most of them are playing Farmville.
Whether it be Facebook, Instagram, Threads, Messenger, WhatsApp, etc. their focus is to acquire users, keep them in their platforms, and own their content - because /human attention is fundamentally valuable/.
Meta owns 40% of the most popular social media platforms today, but their attention economies face great threats: YouTube, TikTok, Telegram, WeChat, and many more threaten to unseat them every year.
Most importantly, the quality of content on these platforms greatly influences their popularity. If Meta can accelerate AI development in all forms, then it means the content quality across all apps/platforms can be equalized - video on YouTube or TikTok will be no more high quality than on Facebook or Instagram. Messages on Threads will be no more engaging than that on Twitter. Their recent experiments with AI generated profiles[0] signals this is the case.
Once content quality - and luring creators to your platform - are neutralized as business challenges that affect end users lurking on the platform and how effectively they can be retained, then it becomes easier for Meta to retain any user that enters their platforms and gain an effective attention monopoly without needing to continue to buy apps that could otherwise succeed theirs.
And so, it is in their benefit to give away their models 'for free', 'speed up' the industry's development efforts in general, de-risk other companies surpassing their efforts, etc.
[0] https://thebaynet.com/meta-faces-backlash-over-ai-generated-...
Meta makes money from ads. To make more money, they either need to capture more of their users' time and show more ads, or show better ads that users click more often.Meta is betting on AI models making it easier to do both.
Better generative AI means you can make more ads faster, which means there are more ad variants to a/b test across, which means it's easier to find an ad that users will click.
To make users stay on their platforms, Meta figures out what content will keep them there, and then shows them that content. Before gen AI, they were only able to show existing content from real users, but sometimes the "ideal" thing for you hasn't been created yet. They bet on the fact that they'll be able to use AI to create hyper-personalized content for their users that engages them better than human-made content.
Only having one model host (hugging face) is bad for obvious reasons (and good in others, yes, but still)
Ollama offering an alternative as a model host seems quite reasonable and quite well implemented.
The frontend really is nothing; it’s just llama.cpp in a go wrapper. It has no value and it’s not really interesting, it’s simple stable technology that is perfectly fine to rely on and be totally unexcited or interested in, technically.
…but, they do a lot more than that; and I think it’s a little unfair to imply that trivial piece of their stack is all they do.
Whilst it's now a UX friendly front-end for llama.cpp, it's also working on adding support for other backends like MLX [1].
Now, applications like ollama obviously need to exist, as not everyone can run CLI utilities, let alone clone a git repo and compile themselves. Easy to use GUIs are essential for the adoption of new tech (much like how there are many apps that wrap ffmpeg and are mostly UI).
However, if ollama are mostly doing commodity GUI things over a fully fleshed-out, _unique_ codebase to which their very existence is owed, they should do everything in their power to point that out. I'm sure they're legally within their rights because of the licensing, but just from an ethical perspective.
I think there is a lot of ill-will towards ollama in some hard-core OG LLM communities because ollama appears to be attempting to capture the value that ggerganov has provided to the world in this tool without adequate attribution (although there is a small footnote, iirc). Basically, the debt that ollama owes to llama.cpp is so immense that they need to do a much better job recognizing it imo.
I use them because they run as a systemd service with a convenient HTTP API. That's been extremely helpful for switching between GUIs.
I also like their model orgazation scheme, and the modelfile paradigm. It's also really handy that it loads and unloads models when called, which is handy for experimentation and some complex workflows eg embedding followed by inference.
Is llama.cpp doing 100% of the "heavy lifting"? Sure, but some light lifting is also needed to lower the activation threshold and bring the value to life.
I would not use llama.cpp, it's simply too cumbersome.
If Ollama did not exist, I would have to invent it.
Is it not "innovative"? Who cares! I want it. Commodity GUI? Again, I don't think they have a GUI at all. Are you maybe thinking of OpenWebUI?
`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like
Phrased differently, when a task has many valid and correct conclusions, this technique allows the LLM to see "How did I do similar tasks before?" and it'll tend to solve new tasks by making similar decisions it made for previous similar tasks.
Two things to note:
- You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.
- You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".
1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.
In this case, we chose an expert that we treat as an objective "source of truth".
re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.
re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.
Then a quick search revealed you can as of a free weeks ago