Vulkan makes a huge positive impact on LLM inference with llama.cpp,especially with multimodal models.
I’m thinking about getting a Radxa Orion O6, but I could not find anything about the state of Vulkan support.
Vulkan makes a huge positive impact on LLM inference with llama.cpp,especially with multimodal models.
I’m thinking about getting a Radxa Orion O6, but I could not find anything about the state of Vulkan support.
I did a llama.cpp build enabling Vulkan but that was super slow, even if I offloaded only one layer to it. I’m clearly ignorant of all the technos related to GPUs, so I don’t understand what vulkan exactly is (driver, library etc), nor how it compares or relates to cuda, opencl etc. I don’t even know if it really used the GPU or fell back to emulation on the CPU. All these things are totally obscure to me, there are probably too many layers of abstraction and cryptic names for me :-/
Has anyone run vkpeak [ https://github.com/nihui/vkpeak ] on the O6?
Thanks!
Wow, this is actually really impressive for a Mali GPU!
fp32-scalar = 2391.64 GFLOPS
fp32-vec4 = 2592.18 GFLOPS
fp16-scalar = 2363.91 GFLOPS
fp16-vec4 = 4964.78 GFLOPS
FWIW, here are the Rock 5B (3588 + Mali Valhall G610) vkpeak benchmarks and vulkaninfo ouput:
The highlights being:
arm_release_ver: g24p0-00eac0, rk_so_ver: 8
device = Mali-G610
fp32-scalar = 467.89 GFLOPS
fp32-vec4 = 496.97 GFLOPS
fp16-scalar = 471.15 GFLOPS
fp16-vec4 = 978.09 GFLOPS
fp16-matrix = 0.00 GFLOPS
int8-dotprod = 1884.12 GIOPS
The G720 MC10 has about 5x the throughput of the G610 MC4.
Thanks for showing the comparison! That’s really damn impressive and might bode well for LLM prompt-processing workloads!
Hi all, was wondering if anyone has investigated the reason for the poor performance of Vulkan with Llama.cpp yet?
Does it look like this might just be an issue with the Mali G720 drivers doing something very un-optimal versus Llama.cpp trying to use a kernel that the Mali G720 just cannot support properly?
How did you get llamacpp compiled?
Locally I run the llamacpp test on O6 board with gpu vulkan backend, the performance is as below:
taskset -c 0,5,6,7,8,9,10,11 llama-bench -m Qwen2.5-3B-Instruct-Q4_0.gguf -pg 128,128 -t 8 -ngl 1000
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | threads | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 1000 | 8 | pp512 | 165.78 ± 0.39 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 1000 | 8 | tg128 | 11.70 ± 0.12 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 1000 | 8 | pp128+tg128 | 21.57 ± 0.01 |
build: 14d627f4 (5288)
And before test, I set GPU to performance mode with below two cmds
echo performance > /sys/class/misc/mali0/device/devfreq/15000000.gpu/governor
echo always_on > /sys/class/misc/mali0/device/power_policy
So that’s interesting because while I’m getting the same PP performance on pure CPU with the same model, I’m getting twice that performance in TG using the CPU:
$ taskset -c 0,5-11 ./build/bin/llama-bench -t 8 -pg 128,128 -m models/qwen2.5-3b-instruct-q4_0.gguf
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| qwen2 3B Q4_0 | 1.86 GiB | 3.40 B | CPU | 8 | pp512 | 167.22 ± 0.02 |
| qwen2 3B Q4_0 | 1.86 GiB | 3.40 B | CPU | 8 | tg128 | 22.96 ± 0.05 |
| qwen2 3B Q4_0 | 1.86 GiB | 3.40 B | CPU | 8 | pp128+tg128 | 40.41 ± 0.01 |
build: 40be51152 (6458)
However I remember that in my previous tests with vulkan the results were much worse. Here it’s “only” twice as slow as the CPU.
FWIW, I think I recall that tweaking the batch-size parameter for Vulkan having a big impact on PP performance. It was still slower than CPU, but there seemed to be a “sweet spot” for best performance.
Also note that shinoda sets -ngl 1000 so, maybe, that’s allocating memory in a way that the GPU can access it quicker? I think I played with that param too, but I can’t remember whether it impacted performance.
I also seem to remember that batch size counts a little bit. However for -ngl, it indicates the max number of layers you’re willing to offload to the GPU, so here it just means that all layers are handled by the GPU, it doesn’t affect anything else. I remember trying to offload only some layers (even just one), hoping to use both the CPU and GPU in parallel, but every time it was much worse, as if the CPU-GPU communication was aggravating the performance.
yes, I got the same ,when enable vulkan, but still want to offload some layers to cpu. the perf is getting much worse.
And when I digger further, seems in this way, when back to cpu, it did not enabled kleidiai or any arm acceleration at all. That would be the cause, but still not getting any clues for this.
maybe llamacpp got some new tricks for armv9 to boost the perf?
I will run the latest to see if I can reproduce the results.
Unfortunately no, I haven’t seen any performance change since the Orion O6 was sent to developers a while ago. I simply think that SVE is fast enough to saturate the memory bandwidth and that the GPU doesn’t have enough cores to reach the same performance.
emm,but why you got twice my tg perf:smirk:
I did manage to get KleidAI to build, but it actually caused worse performance for me:
- I did manage a successful build with KleidiAI support - but prompt ingestion performance did not appear to improve and token generation was actually significantly slower (tg on Qwen3 A3B:30B:Q4_K_M down to around 12 t/s).
You’ll get about the same if you run purely on CPU. I’m building with:
cmake -B build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DBUILD_SHARED_LIBS=OFF -DGGML_OPENMP=OFF -DGGML_NATIVE=OFF -DLLAMA_CURL=OFF -DCMAKE_CXX_FLAGS:STRING="-Ofast -DNDEBUG -mcpu=cortex-a720+sve+dotprod -pthread" -DCMAKE_C_FLAGS:STRING="-Ofast -DNDEBUG -mcpu=cortex-a720+sve+dotprod -pthread"
I tried with vulkan and opencl and on this machine the GPU only degrades performance compared to the CPU. Also don’t forget that the CPU has a limited memory bandwidth. We suspected that the GPU and NPU would probably not be subject to the same limitation as not being connected to the DSU, but that doesn’t necessarily mean that they benefit from higher capabilities either. And memory performance is the #1 cause of limitation for TG.
How about some O6 results with the Mesa driver?
I don’t have an O6 yet but the GPU is architecturally similar to the Rock 5B’s G610.
Unfortunately, with the latest Mesa Panthor driver, vkpeak reports about 1/6th the GFLOPS of the ARM Mali driver on the G610.
80 vs. ~500 FP32 GFLOPS is quite a difference.