• loufe a day ago |
    It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
    • not_a_dane a day ago |
      Nothing is free in this world.
    • zackangelo a day ago |
      I've only read the abstract but they don't mention quantizing the weights or otherwise trying to shrink the model in any way.

      They're claiming to be able to efficiently run larger models without loading the entire thing into GPU memory. If they're using the same weights, the same architecture and just using tensor parallel operations to perform the forward pass that would imply no loss in quality.

      I'm sure there are trade-offs but they're not clear by just looking at the abstract.

      • tgtweak a day ago |
        I read it like this too - no drop in weights or model quality just optimizing the lower boundaries of performance when you are splitting from vram to ram to disk (or network).
    • freehorse a day ago |
      From what I get skimming through the article the main cost is speed of token generation (token latency). You can always run a large model by reading directly from the disk and not care much about ram; but it is very slow. They try to improve that aspect doing some optimisations, but it is still definitely slower than using ram or vram.
      • refulgentis a day ago |
        Table 3 directly refutes this* and claims 0 tradeoffs.**

        Below that, they indicate that a key part of the implementation is loading weights from disk before they're needed using a separate thread.***

        * maybe I'm missing something though, someone please triple check :)

        ** ttft (time to first token) and s/token (seconds per token) are both lower than any alternative in all cases.

        *** "daemon thread asynchronously preloads the weights"

        • sgc a day ago |
          I want to add that their chart shows s/token per device (edit: as per the heading on table 1 - it could also be confusing grammar), so it sounds like you are getting 4x the listed s/t on their 4 laptop cluster . Their laptops are not even hardwired - they are connecting over wifi.

          This comes at a very interesting time for me. I have an ancient dual xeon workstation with 64gb memory that I was researching how to convert to run an llm. I can just run that with 4 instances on the same machine and see how it goes, without purchasing a better GPU, to start. It sounds like this will allow you to run very large models with minimal quants, on craigslist quality devices.

          If it does what they say it does (and it seems to do), it will be an absolute game changer for most users.

    • woadwarrior01 a day ago |
      It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.
      • m3kw9 a day ago |
        Ah the disk swap method
        • thelastparadise a day ago |
          Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?
          • tcdent a day ago |
            Depends on the architecture, but generally you just move through the layers linearly. Simple iteration.

            The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.

            You really need the entire model on device to consider it performant.

          • miki123211 a day ago |
            This isn't how neural networks work.

            For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.

        • _ache_ a day ago |
          It's not disk swap. It's multi-devices LLM.
          • kridsdale3 a day ago |
            That looked like an analogy. Back in the days of a mechanical arm moving magnetic fields around in our PCs, you could have the illusion of infinite RAM as long as you're ok with microsecond operations now taking two million times longer. This is akin.
          • wpietri a day ago |
            I think the point is that it has the same sort of latency tradeoff that disk swap did: it's awful, but sometimes better than nothing.
      • _ache_ a day ago |
        That is s/token and not token/s. The cost is high.

        The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.

      • teraflop a day ago |
        Am I just misunderstanding, or is the paper using "latency" when what they really mean is "throughput"?

        In other words, if I want 100 tokens of output, do I have to wait 2990 seconds? If so, the terminology seems unnecessarily confusing.

  • dvh a day ago |
    So when will I be able to "sudo apt-get install llm" ?
    • jsanders9 a day ago |
      Ollama is close...
    • mysterhawk a day ago |
      You can already do it with llamafile, checkout the project, it lets you convert a .gguf model in a portable executable
    • yjftsjthsd-h a day ago |
      I'm not aware of any Debian family distro that packages it, but NixOS has at least ollama and llama-cpp in its repos. Honestly even if the more stable distributions did have these things packaged, I would hesitate to use the packaged versions because all of this stuff is still so quickly moving that you'd be on an old version and it would hurt.

      Edit: Arch has ollama in official repos too. OpenSUSE has https://software.opensuse.org/package/ollama .

    • paxys a day ago |
      You already can with ollama
    • o11c a day ago |
      Realistically, you probably want to wait until Vulkan support trickles out. That way, you aren't at the whim of the various evil hardware drivers (everybody's suck), and the AI can give you a disappointingly confused answer much faster than running the LLM on a CPU can.
  • vessenes a day ago |
    This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.

    They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.

    That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.

    I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

    Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.

    • tgtweak a day ago |
      I think the main advantage here is you COULD run it, even it it takes a while. That is a step up from current model limitations which require ram or vram to hold the model.

      I think this lays some groundwork for running a 400B model on a 3090/4090 or even smaller GPU. If you can get a huge model like that running on a single gpu even if the mean time per token is in the seconds, that's acceptable for many use cases.

      If this same technique can be used to extend context windows in addition to token autocomplete, that would be great in it's own regard.

      Hopefully work like this continues as throwing a ton of vram at a model should be regarded as a performance optimization not necessarily a requirement.

      • michaelt a day ago |
        It's already technically possible to run huge models locally when you don't have the RAM/VRAM needed - llama.cpp can 'mmap' the model from disk.

        Of course an nvidia 4090 has a memory bandwidth of a 1000 GB per second; a CPU like the i7-13700K has a memory bandwidth of 90 GB per second; and a high-end NVMe SSD might only have read bandwidth of 10 GB per second.

        So in approximate terms, an LLM and quantisation level that can produce 10 tokens per second on a 4090 will produce 1 token per second from RAM and a token every 10 seconds from SSD.

        • slimsag a day ago |
          I haven't been able to get llama.cpp's mmap logic to work on macOS
        • tgtweak a day ago |
          Yeah mmap and leaving the caching up to the OS or storage drivers instead of model-informed placement is not going to yield the same results. Observing data access patterns, pre-empting inference data requirements and understanding hardware latency when distributing the model can yield some pretty significant results, as that same approach does in other domains. This is adjacent to compiler optimization.

          It would be very meta to use AI to observe these access patterns and distribute the model accordingly based on usage, so that it can optimize placement for your given context domain.

      • ignoramous a day ago |
        > That is a step up from current model limitations which require ram or vram to hold the model.

        Current? Apple recently published a neat paper on how they optimise for both inference (cpu/gpu) and memory use:

          Our method involves constructing an inference cost model that takes into account
          the characteristics of flash memory, guiding us to optimize in two critical areas:
          reducing the volume of data transferred from flash and reading data in larger, more contiguous
          chunks. Within this hardware-informed framework, we introduce two principal techniques.
          First, “windowing” strategically reduces data transfer by reusing previously activated neurons,
          and second, “row-column bundling”, tailored to the sequential data access strengths
          of flash memory, increases the size of data chunks read from flash memory. These methods
          collectively enable running models up to twice the size of the available DRAM, with
          up to 4x and 20x increase in inference speed compared to naive loading approaches in CPU
          and GPU, respectively.
        
        https://news.ycombinator.com/item?id=38704982
        • tgtweak a day ago |
          Similar but I think the apple approach here requires model modification whereas afaik OPs solution works with the model verbatim. I could be wrong as I haven't looked into the code but given the specificity of the first article regarding hardware and model, I would assume that.
      • diggan a day ago |
        > I think the main advantage here is you COULD run it, even it it takes a while.

        I mean, you COULD run it before as well, even if you don't have enough RAM or VRAM, by using something like `zram`. It'd probably be even slower (and border-line usable, depending on the use case), but it's not impossible to get things to run.

        • prometheon1 17 hours ago |
          Zram compresses part of the data in ram, right? Can an LLM be compressed?
          • xena 10 hours ago |
            Not really. The weights are mostly random numbers.
    • alchemist1e9 a day ago |
      > I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

      Could be a big deal if it allows cluster of smaller GPUs to compete with a single large VRAM GPU.

      Unfortunately I’m a few months of date - which is an eternity in LLM inference techniques - so I’m not sure what current state of distributed inference looks like.

      • Palomides a day ago |
        llama.cpp supports splitting work across multiple nodes on a network already

        it essentially just copies a chunk of the model to each one, works well for situations where each machine has limited vram

      • vessenes a day ago |
        Yeah, I think these methods could be baked into llama.cpp or some other higher up the toolchain python library or what have you. They shard out each layer (ish?) to the edges, and recombine that layer inference at the master node, while the outside edges load up their next bit if they need to; I would guess the devil is in the details for all the possible tensor types and architectures (for instance, how shall we implement skip layers?).
    • k1musab1 a day ago |
      Do you think this could allow distributed inference only, or opens the door for distributed training of the model? Democratization of the models is in part hampered by the total compute a single person or small group can make use of, but if a project like folding@home, but for training large models is possible, it could change the game somewhat.
  • Zetaphor a day ago |
    Is this different from (or related to) the work being done by the exo project?

    https://github.com/exo-explore/exo

    • tgtweak a day ago |
      Exo is for partitioning over network across devices (implementing some bandwidth-reducing partitions) but still requires a minimum ram/vram requirement to load a model. This could, in theory, be combined to allow larger models to run on exo clusters with less gpu/ram than is required by the underlying model (at the cost of some performance no doubt, but still).
      • alexandercheema a day ago |
        exo maintainer here. tgtweak is correct.

        This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.

        • tgtweak a day ago |
          Yeah combining these two would make a lot of sense, there is a big appetite to run larger models - even slower - on clustered hardware. This way you can add compute to speed up the token pace vs adding it just to run the model at all.

          It is also possible some of these optimizations could help optimize distribution based on latency and bandwidth between nodes.

  • tgtweak a day ago |
    Is there a cuda implementation of this... asking for a friend
  • adam_arthur a day ago |
    While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.

    Most of the rewards will be reaped by consumers rather than providers.

    We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.

    How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.

    It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.

    • lxgr a day ago |
      Enterprises use LLMs too, and quite often there wouldn't be any client you could reasonably run the model on. (You wouldn't want to e.g. have an LLM summarize and categorize a user request on their device, since that would require you shipping your model and/or internal knowledge base to the client).
      • adam_arthur a day ago |
        Yes, but if you can run a sufficient LLM on a $2,000 laptop, then the cost to serve it from the cloud will be similarly cheap. (e.g. reserve an appropriately sized EC2 instance for pennies on the dollar)

        It's a highly competitive market. Companies aren't going to pay 100k/year to run a model on something that can run on a 2k consumer grade device.

        128GB of gpu accessible/fast RAM can be had for $5000 on a macbook pro today. What will it be 3-4 years from now on linux/windows machines?

        And we still haven't seen any SoC providers try to optimize for RAM capacity over compute yet.

        • lxgr a day ago |
          Oh yes, I could definitely see the privacy-preserving consumer use case creating sufficient demand for efficiency that also bleeds over into the enterprise market.

          That's what's happened with power efficiency and ARM CPUs, after all!

          • adunsulag a day ago |
            This is where I want highly sensitive healthcare consumers of LLMs to be at. Note summation, suggested diagnosis (provider always in control), and other augmented abilities for the clinical staff without the risk of health care data sent outside the device, or the very local network.
          • adam_arthur a day ago |
            Not sure what you mean: https://aws.amazon.com/ec2/graviton/

            Not to speak of managed cloud services that run on ARM under-the-hood/behind the scenes.

            Of course ARM isn't inherently cheaper, AMD+Intel could cut prices/margins big and probably be competitive on $/perf

            • lxgr 9 hours ago |
              That's what I mean: ARM was initially attractive in the low-power/low-performance market, then it scaled up to higher and higher power cores (while still staying power efficient), which in turn attracted datacenter customers.
      • simsla a day ago |
        Depends, shipping part of it (just an encoder or decoder) could still work.
        • lxgr a day ago |
          Even if bandwidth weren't an issue and all users had compatible hardware: You'd still be offloading a (semi-)trusted computation to user hardware, which is usually completely untrusted.
  • tonetegeatinst 10 hours ago |
    While training seems to be out of reach for the average tech user unless they have a data center for a homelab or a very large income, SOTA models can be easily run on the edge devices either on a phone or a dedicated computer/server.

    LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.