They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.
I'm sure there are trade-offs but they're not clear by just looking at the abstract.
Below that, they indicate that a key part of the implementation is loading weights from disk before they're needed using a separate thread.***
* maybe I'm missing something though, someone please triple check :)
** ttft (time to first token) and s/token (seconds per token) are both lower than any alternative in all cases.
*** "daemon thread asynchronously preloads the weights"
This comes at a very interesting time for me. I have an ancient dual xeon workstation with 64gb memory that I was researching how to convert to run an llm. I can just run that with 4 instances on the same machine and see how it goes, without purchasing a better GPU, to start. It sounds like this will allow you to run very large models with minimal quants, on craigslist quality devices.
If it does what they say it does (and it seems to do), it will be an absolute game changer for most users.
The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.
You really need the entire model on device to consider it performant.
For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.
The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.
In other words, if I want 100 tokens of output, do I have to wait 2990 seconds? If so, the terminology seems unnecessarily confusing.
https://huggingface.co/bartowski/Qwen2.5-72B-Instruct-GGUF
Man I need to test the q8 version with llamafiles optimizations, it would be so nice to host it locally with the new ryzens, it could maybe fit my 96GB of ram
Edit: Arch has ollama in official repos too. OpenSUSE has https://software.opensuse.org/package/ollama .
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.
If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.
Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.
Of course an nvidia 4090 has a memory bandwidth of a 1000 GB per second; a CPU like the i7-13700K has a memory bandwidth of 90 GB per second; and a high-end NVMe SSD might only have read bandwidth of 10 GB per second.
So in approximate terms, an LLM and quantisation level that can produce 10 tokens per second on a 4090 will produce 1 token per second from RAM and a token every 10 seconds from SSD.
It would be very meta to use AI to observe these access patterns and distribute the model accordingly based on usage, so that it can optimize placement for your given context domain.
Current? Apple recently published a neat paper on how they optimise for both inference (cpu/gpu) and memory use:
Our method involves constructing an inference cost model that takes into account
the characteristics of flash memory, guiding us to optimize in two critical areas:
reducing the volume of data transferred from flash and reading data in larger, more contiguous
chunks. Within this hardware-informed framework, we introduce two principal techniques.
First, “windowing” strategically reduces data transfer by reusing previously activated neurons,
and second, “row-column bundling”, tailored to the sequential data access strengths
of flash memory, increases the size of data chunks read from flash memory. These methods
collectively enable running models up to twice the size of the available DRAM, with
up to 4x and 20x increase in inference speed compared to naive loading approaches in CPU
and GPU, respectively.
https://news.ycombinator.com/item?id=38704982I mean, you COULD run it before as well, even if you don't have enough RAM or VRAM, by using something like `zram`. It'd probably be even slower (and border-line usable, depending on the use case), but it's not impossible to get things to run.
Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.
Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.
it essentially just copies a chunk of the model to each one, works well for situations where each machine has limited vram
you run that on the remote nodes
This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.
It is also possible some of these optimizations could help optimize distribution based on latency and bandwidth between nodes.
Most of the rewards will be reaped by consumers rather than providers.
We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.
How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.
It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
It's a highly competitive market. Companies aren't going to pay 100k/year to run a model on something that can run on a 2k consumer grade device.
128GB of gpu accessible/fast RAM can be had for $5000 on a macbook pro today. What will it be 3-4 years from now on linux/windows machines?
And we still haven't seen any SoC providers try to optimize for RAM capacity over compute yet.
That's what's happened with power efficiency and ARM CPUs, after all!
Not to speak of managed cloud services that run on ARM under-the-hood/behind the scenes.
Of course ARM isn't inherently cheaper, AMD+Intel could cut prices/margins big and probably be competitive on $/perf
LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.