Llama.cpp - Benchmarks

jimhamiru · July 11, 2025, 8:30am

Hi all,

I bought the Orion O6 primarily with the intent of running LLM’s on it (I needed something with relatively low power consumption - I run exclusively on solar).

I figure some benchmarks for Llama.cpp might interest others. Do not expect great speeds - it is constrained by the RAM bandwidth for token generation - and prompt ingestion is currently CPU-only (though would love to see how it would perform using the NPU).

A preliminary note, I built Llama.cpp using the following:

# NOTE: The use of GGML_CPU_ARM_ARCH - without this, Llama may not detect NEON/SIMD support on the Cix P1 (this was the case for the Radxa Debian image).
cmake -B build -DGGML_CPU_ARM_ARCH=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DGGML_NATIVE=off

# Build it (use 6 threads)
cmake --build build --config Release -j 6

The reason for this is that Llama’s build script has some conditionals when determining the CPU feature-set (and these weren’t being detected correctly using Radxa’s Debian Image). So, it was being built without the ARM NEON kernels (this does impact prompt processing significantly). As to why that feature-list is so long - I pretty much threw the kitchen sink at it in case there were other optimizations possible (most are likely not used).

Some other quick notes:

I did manage a successful build with KleidiAI support - but prompt ingestion performance did not appear to improve and token generation was actually significantly slower (tg on Qwen3 A3B:30B:Q4_K_M down to around 12 t/s).
Building with Vulkan, technically, succeeded - but it freezes (or is ridiculously slow) while trying to ingest/generate tokens.
7 threads appears to be the sweet spot. Using -1 (auto), it was ridiculously slow (maybe trying to use small cores which aren’t available?)
My Orion O6 is in the Radxa AI Kit case. I haven’t placed a copper plate to make better contact with the heatsink yet, so it runs a little hot (CPU B1 @ ~65C). Power consumption (according to my USB-C charger) is around ~20W when running (20V, 1A).

Benchmarks

/llama-bench -t 7 -m ../../../models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf      
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           pp512 |         23.31 ± 0.07 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           tg128 |         16.13 ± 0.08 |

build: cb9178f8 (5857)

# NOTE: Too slow, so only did one repeat.
./llama-bench -t 7 -m ../../../models/Qwen_Qwen3-32B-Q4_K_M.gguf -r 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CPU        |       7 |           pp512 |          3.90 ± 0.00 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CPU        |       7 |           tg128 |          1.95 ± 0.00 |

build: cb9178f8 (5857)

# NOTE: Too slow, so only did one repeat.
./llama-bench -t 7 -m ../../../models/Hunyuan-A13B-Instruct-Q4_K_M.gguf -r 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| hunyuan-moe A13B Q4_K - Medium |  45.43 GiB |    80.39 B | CPU        |       7 |           pp512 |          6.65 ± 0.00 |
| hunyuan-moe A13B Q4_K - Medium |  45.43 GiB |    80.39 B | CPU        |       7 |           tg128 |          3.71 ± 0.00 |

build: cb9178f8 (5857)

For anyone interested, here’s an example of Qwen3 A3B:30B without NEON/SIMD.

# NOTE: Built WITHOUT NEON/SIMD
./llama-bench -t 7 -m ../../../models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           pp512 |         14.85 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           tg128 |         12.41 ± 0.01 |

build: cb9178f8 (5857)

jimhamiru · July 25, 2025, 3:56am

Just a small update on the status of Llama.cpp with Orion O6 + Vulkan:

Technically, Vulkan can “kind of” work. But it’s far slower than CPU.

I found that tweaking:

The -ngl parameter to zero (llama-bench defaults to 99) so that no layers are offloaded and…
The -b(atch) parameter to a low number (e.g. 16)

I was able to successfully run llama-bench. Performance is far below CPU though. For example, Qwen3-32B sat at around 2 t/s for Prompt Processing and Token Gen was around 1.1 t/s or something. Versus CPU which is around 4 t/s and 2 t/s.

This was tested using the Radxa Debian Image, so perhaps there are system config tweaks that can improve performance. Maybe it’s related to the UMA architecture and how the GPU is allowed to access the shared memory?

darkmode · August 4, 2025, 1:45am

So are you more satisfied using this quant?

jimhamiru · August 4, 2025, 2:07am

I’m actually really impressed by the latest Qwen3-30B-A3B released on 2025-07-25:

https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

In terms of Q4_K_M, I went with that quant because it seems to be a good trade-off in terms of perplexity loss and performance. In my experience, for simple queries, Q4 is typically good enough.

If we go for larger quants:

It would require more RAM.
Token generation would be slower (e.g. a Q8 would probably halve token generation performance due to RAM bandwidth constraints on the O6).

That said, the NPU natively supports INT8, so it might be worth using a Q8 if the NPU ever becomes supported (would likely speed up Prompt Processing a lot).

There is also an ALIGNED_INT4 type in the NPU source-code, so it might be that INT4 is supported too (for efficient Prompt Processing of Q4 quants). I’m not confident on this though - if someone might be able to verify that’s the case, would appreciate it.

I don’t think the NPU SDK is in a very good state yet though. See the post here for details: C++ Example running YOLOv8 on the NPU

It might be a very long time before we’re able to use it with Llama (if ever).