(If you are on Windows, there is usually a win-hip binary of llama.cpp in the project's releases or if things totally refuse to work, you can use the Vulkan build as a (less performant) fallback).
Having more options can't hurt, but ROCm 5.4.2 is almost 2 years old, and things have come a long way since then, so I'm curious about this being published freshly today, in October 2024.
BTW, I recently went through and updated my compatibility doc (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has changed just in the past few months (upstream bitsandbytes, upstream xformers, and Triton-based Flash Attention): https://llm-tracker.info/howto/AMD-GPUs
tldr: uses the latest rocm 6.2 to run full precision inference for llama 405b on a single node 8 x MI300x AMD GPU
How mature do you think Rocm 6.2-AMD stack is compared to Nvidia ?
has a docker image but no examples to run it https://github.com/ggerganov/llama.cpp/blob/master/docs/dock...
has a docker image but no examples to run it https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#do...
docker image was broken for me on 7800xt running rhel9 https://github.com/Atinoda/text-generation-webui-docker
"Docker-based" reads, to me, as if you were doing Inference on AMD cards with Docker somehow, which doesn't make sense.
Like anything AI and AMD, you need the right card(s) and rocm version along with sheer dumb luck to get it working. AMD has Docker images with rocm support, so you could merge your app in with that as the base layer. Just pass through the GPU to the container and you should get it working.
It might just be the software in a Docker image, but it removes a variable I would otherwise have to worry about during deployment. It literally is inference on AMD with Docker, if that's what you meant.
https://github.com/slashml/amd_inference/blob/main/run-docke...
First sentence of the README in the repo. Was it somehow unclear?
sudo apt install rocm
Summary:
Upgrading: 0, Installing: 203, Removing: 0, Not Upgrading: 0
Download size: 2,369 MB / 2,371 MB
Space needed: 35.7 GB / 822 GB available
I don't understand how 36 GB can be justified for what amounts to a GPU driver.By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.
Do you know what is included in ROCm that could be so big? Does it include training datasets or something?
-- matrix multiply 2048x2048 for Navi 31,
-- same for Navi 32,
-- same for Navi 33,
-- same for Navi 21,
-- same for Navi 22,
-- same for Navi 23,
-- same for Navi 24, etc.
-- matrix multiply 4096x4096 for Navi 31,
-- ...
We're working on a few different strategies to reduce the binary size. It will get worse before it gets better, but I think you can expect significant improvements in the future. There are lots of ways to slim the libraries down.
4.8G hipblaslt
1.6G libdevice_conv_operations.a
2.0G libdevice_gemm_operations.a
1.4G libMIOpen.so.1.0.60200
1.1G librocblas.so.4.2.60200
1.6G librocsolver.so.0.2.60200
1.4G librocsparse.so.1.0.60200
1.5G llvm
3.5G rocblas
2.0G rocfft
The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLtThere are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:
304K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.co
24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.dat
240K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.co
20K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.dat
344K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.co
24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.dat
Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?
There's also a few tricks/updates I'd like to try which may improve performance, e.g. hipblaslt support being added next rocm release - of course these are "maybes".
To give you a rough idea of practical performance, default SDXL with xformers is around 4.5-5it/s (between 3090 and 4090 from my understanding), and exllamav2 with qwen 72B at 3bpw is around 7t/s (slower than a 3090, though a 3090 has to use a lower precision to fit).
As others have pointed out, I can't really see what this project offers for AMD users over existing options like llama.cpp, exllamav2, mlc-ai, etc. Most projects work relatively easily these days.
> # Other AMD-specific optimizations can be added here
> # For example, you might want to set specific flags or use AMD-optimized libraries
What are we doing here, then?
BTW if you just want to play with a local LLM, you can try my old port of Mistral: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... Unlike CUDA or ROCm my port is based on Direct3D 11 GPU API, runs on all GPUs regardless of the brand.
You just need to update the version check here
https://github.com/slashml/amd_inference/blob/4b9ec069c4b2ac...
feel free to open an issue, with the requirements and we will test it.
Your APU should be similar, just faster.
There are some magic environment variables you want to set to get ROCM to work with this technically unsupported APU: HSA_OVERRIDE_GFX_VERSION=9.0.0 HSA_ENABLE_SDMA=0
Performance is not great, but slightly better than running inference on the CPU, with the bonus that your CPU is essentially free for other tasks even while running LLMs.
This library is 50% print statements. And where it does branch, it doesn't even need to.
Defines two environment variables and sets two flags on torch.
Expectation management is a huge part of any team/organization, I think
I wouldn't disparage an entire field for lack of a clear definition in the buzzwords people use to refer to it.
mdaniel is absolutely correct. They are not software engineers.
I'm not vocal about the naïve stuff, poor design, sloppy formatting, bad english. I am vocal about projects that have no place in the ecosystem.
# install dependencies
sudo apt -y update
sudo apt -y upgrade
sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential
# ensure you have permissions by adding yourself to the video and render groups
sudo usermod -aG video,render $USER
# log out and then log back in to apply the group changes
# you can run `rocminfo` and look for your GPU in the output to check everything is working thus far
# download a model, build llama.cpp, and run it
wget https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b3267
HIPCXX=clang-17 cmake -H. -Bbuild -DGGML_HIPBLAS=ON -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" -DCMAKE_BUILD_TYPE=Release
make -j16 -C build
build/bin/llama-cli -ngl 32 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf --prompt "Once upon a time"
I'd suggest RDNA 3, MI200 and MI300 users should probably use the AMD-provided ROCm packages for improved performance. Users that need PyTorch should also use the AMD-provided ROCm packages, as PyTorch has some dependencies that are not available from the system packages. Still, you can't beat the ease of installation or the compatibility with older hardware provided by the OS packages.¹ https://lists.debian.org/debian-ai/2024/07/msg00002.html ² Not including MI300 because that released too close to the Ubuntu 24.04 launch. ³ Pre-Vega architectures might work, but have known bugs for some applications. ⁴ Vega and RDNA 2 APUs might work with Linux 6.10+ installed. I'm in the process of testing that. ⁵ The version of rocBLAS that comes with Ubuntu 24.04 is a bit old and therefore lacks some optimizations for RDNA 3. It's also missing some MI200 optimizations.
$3,600 - 61 TFLOPS - AMD Radeon Pro W7900
$4,200 - 38.7 TFLOPS - NVidia RTX A6000 48GB Ampere
$7,200 - 91.1 TFLOPS - NVidia RTX A6000 48GB Ada
Those old gfx906 or gfx908 cards are more competitive for fp64 than for low-precision AI workloads, but they have the memory and the price is right. I'm not sure I would recommend the hacker approach to the average user, but it is what I've done for some of the continuous integration servers I host for the Debian project.
hardware.graphics.enable = true;
services.ollama = {
enable = true;
acceleration = "rocm";
environmentVariables = {
ROC_ENABLE_PRE_VEGA = "1";
HSA_OVERRIDE_GFX_VERSION = "11.0.0";
};
};
https://github.com/slashml/amd_inference/blob/main/Dockerfil...
Also looks like the Docker image provided by this project doesn't successfully build: https://github.com/slashml/amd_inference/issues/2