Llama.cpp - Benchmarks

jimhamiru · July 11, 2025, 8:30am

Hi all,

I bought the Orion O6 primarily with the intent of running LLM’s on it (I needed something with relatively low power consumption - I run exclusively on solar).

I figure some benchmarks for Llama.cpp might interest others. Do not expect great speeds - it is constrained by the RAM bandwidth for token generation - and prompt ingestion is currently CPU-only (though would love to see how it would perform using the NPU).

A preliminary note, I built Llama.cpp using the following:

# NOTE: The use of GGML_CPU_ARM_ARCH - without this, Llama may not detect NEON/SIMD support on the Cix P1 (this was the case for the Radxa Debian image).
cmake -B build -DGGML_CPU_ARM_ARCH=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DGGML_NATIVE=off

# Build it (use 6 threads)
cmake --build build --config Release -j 6

The reason for this is that Llama’s build script has some conditionals when determining the CPU feature-set (and these weren’t being detected correctly using Radxa’s Debian Image). So, it was being built without the ARM NEON kernels (this does impact prompt processing significantly). As to why that feature-list is so long - I pretty much threw the kitchen sink at it in case there were other optimizations possible (most are likely not used).

Some other quick notes:

I did manage a successful build with KleidiAI support - but prompt ingestion performance did not appear to improve and token generation was actually significantly slower (tg on Qwen3 A3B:30B:Q4_K_M down to around 12 t/s).
Building with Vulkan, technically, succeeded - but it freezes (or is ridiculously slow) while trying to ingest/generate tokens.
7 threads appears to be the sweet spot. Using -1 (auto), it was ridiculously slow (maybe trying to use small cores which aren’t available?)
My Orion O6 is in the Radxa AI Kit case. I haven’t placed a copper plate to make better contact with the heatsink yet, so it runs a little hot (CPU B1 @ ~65C). Power consumption (according to my USB-C charger) is around ~20W when running (20V, 1A).

Benchmarks

/llama-bench -t 7 -m ../../../models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf      
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           pp512 |         23.31 ± 0.07 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           tg128 |         16.13 ± 0.08 |

build: cb9178f8 (5857)

# NOTE: Too slow, so only did one repeat.
./llama-bench -t 7 -m ../../../models/Qwen_Qwen3-32B-Q4_K_M.gguf -r 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CPU        |       7 |           pp512 |          3.90 ± 0.00 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CPU        |       7 |           tg128 |          1.95 ± 0.00 |

build: cb9178f8 (5857)

# NOTE: Too slow, so only did one repeat.
./llama-bench -t 7 -m ../../../models/Hunyuan-A13B-Instruct-Q4_K_M.gguf -r 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| hunyuan-moe A13B Q4_K - Medium |  45.43 GiB |    80.39 B | CPU        |       7 |           pp512 |          6.65 ± 0.00 |
| hunyuan-moe A13B Q4_K - Medium |  45.43 GiB |    80.39 B | CPU        |       7 |           tg128 |          3.71 ± 0.00 |

build: cb9178f8 (5857)

For anyone interested, here’s an example of Qwen3 A3B:30B without NEON/SIMD.

# NOTE: Built WITHOUT NEON/SIMD
./llama-bench -t 7 -m ../../../models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           pp512 |         14.85 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           tg128 |         12.41 ± 0.01 |

build: cb9178f8 (5857)

jimhamiru · July 25, 2025, 3:56am

Just a small update on the status of Llama.cpp with Orion O6 + Vulkan:

Technically, Vulkan can “kind of” work. But it’s far slower than CPU.

I found that tweaking:

The -ngl parameter to zero (llama-bench defaults to 99) so that no layers are offloaded and…
The -b(atch) parameter to a low number (e.g. 16)

I was able to successfully run llama-bench. Performance is far below CPU though. For example, Qwen3-32B sat at around 2 t/s for Prompt Processing and Token Gen was around 1.1 t/s or something. Versus CPU which is around 4 t/s and 2 t/s.

This was tested using the Radxa Debian Image, so perhaps there are system config tweaks that can improve performance. Maybe it’s related to the UMA architecture and how the GPU is allowed to access the shared memory?

darkmode · August 4, 2025, 1:45am

So are you more satisfied using this quant?

jimhamiru · August 4, 2025, 2:07am

I’m actually really impressed by the latest Qwen3-30B-A3B released on 2025-07-25:

https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

In terms of Q4_K_M, I went with that quant because it seems to be a good trade-off in terms of perplexity loss and performance. In my experience, for simple queries, Q4 is typically good enough.

If we go for larger quants:

It would require more RAM.
Token generation would be slower (e.g. a Q8 would probably halve token generation performance due to RAM bandwidth constraints on the O6).

That said, the NPU natively supports INT8, so it might be worth using a Q8 if the NPU ever becomes supported (would likely speed up Prompt Processing a lot).

There is also an ALIGNED_INT4 type in the NPU source-code, so it might be that INT4 is supported too (for efficient Prompt Processing of Q4 quants). I’m not confident on this though - if someone might be able to verify that’s the case, would appreciate it.

I don’t think the NPU SDK is in a very good state yet though. See the post here for details: C++ Example running YOLOv8 on the NPU

It might be a very long time before we’re able to use it with Llama (if ever).

jimhamiru · August 12, 2025, 7:36am

Decided to try out ik_llama.cpp as I’d heard there were some optimizations for CPU (particularly for Prompt Processing). The improvement is really dramatic for PP and for Qwen30B-A3B-Instruct-2507 Q4_M, it pretty much doubles the TPS.

How to Build

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama

# THIS IS IMPORTANT! MUST SET ARCH FLAGS MANUALLY!
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DCMAKE_C_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DCMAKE_CXX_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DGGML_NATIVE=off

# Build using 7 threads
cmake --build ./build --config Release -j 7

Benchmarks

ik_llama.cpp

./llama-bench -m ../../../models/Qwen_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf -t 7       
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| qwen3moe ?B Q4_K - Medium      |  17.35 GiB |    30.53 B | CPU        |       7 |         pp512 |     50.51 ± 0.27 |
| qwen3moe ?B Q4_K - Medium      |  17.35 GiB |    30.53 B | CPU        |       7 |         tg128 |     15.44 ± 0.01 |

build: d99cf7cb (3836)

llama.cpp:

./llama-bench -m ../../../models/Qwen_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf -t 7
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           pp512 |         24.33 ± 0.06 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | CPU        |       7 |           tg128 |         14.94 ± 0.03 |

build: cb9178f8 (5857)

udif · August 14, 2025, 9:19am

For some reason, I’m getting ~10% lower results for ik_llama.cpp

sh-5.2$ build/bin/llama-bench -m ../Qwen_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf -t 7
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| qwen3moe ?B Q4_K - Medium      |  17.35 GiB |    30.53 B | CPU        |       7 |         pp512 |     45.76 ± 0.85 |
| qwen3moe ?B Q4_K - Medium      |  17.35 GiB |    30.53 B | CPU        |       7 |         tg128 |     13.79 ± 0.75 |

build: e082df47 (3837)

Given a 64GB O6 board, what would be the current best local model(s) to run for coding? I would like to run aider.chat or other CLI tool on one machine and use the O6 as an LLM server. Aider supports separate “architect” and “coder” models, so it is possible to run two parallel 30B models (4 bits).
I’m not an AI expert, but would love to use the available local models.

jimhamiru · August 15, 2025, 1:07am

For some reason, I’m getting ~10% lower results for ik_llama.cpp

Did you use this build command:

cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DCMAKE_C_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DCMAKE_CXX_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DGGML_NATIVE=off

I’m not sure ik_llama will correctly detect/optimize-for ARM architecture otherwise (missing Neon and SVE IIRC). It might also be something to do with clock frequency - I haven’t checked what mine is running at (but I had to downgrade to an older BIOS version).

Given a 64GB O6 board, what would be the current best local model(s) to run for coding?

Might want to check this one out:

https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct

I really think anything beyond 3B active is probably heading into impractical speeds territory: The RAM bandwidth is just too slow for Token Generation on bigger models.

willy · August 15, 2025, 7:09am

Note that the CPU optimizations don’t provide that large of a gain (with the default llama.cpp). I’ve just run llama-bench below with -mcpu=native and with -mcpu=cortex-a720+sve+dotprod which I’m normally using on this machine, the difference is around 1.5%. I’ve tested your flags and they give me the exact same results as these optimizations, and the executable is the same size as well, indicating that the compiler didn’t make use of any other feature:

build: 4227c9be (6170)
-mcpu=native first:

model	size	params	backend	threads	test	t/s
qwen3moe 30B.A3B Q4_1	17.89 GiB	30.53 B	CPU	8	pp512	30.48 ± 0.01
qwen3moe 30B.A3B Q4_1	17.89 GiB	30.53 B	CPU	8	tg128	19.90 ± 0.01

And -mcpu=cortex-a720+sve+dotprod below:

model	size	params	backend	threads	test	t/s
qwen3moe 30B.A3B Q4_1	17.89 GiB	30.53 B	CPU	8	pp512	30.97 ± 0.00
qwen3moe 30B.A3B Q4_1	17.89 GiB	30.53 B	CPU	8	tg128	20.30 ± 0.01

I didn’t know about ik_llama, I’ll have to have a look, thanks for the hint!

willy · August 15, 2025, 8:22am

I just ran ik_llama on the same model and it gave me an impressive increase of PP (x2.46) but a 20% decrease in TG:

model	size	params	backend	threads	test	t/s
qwen3moe ?B Q4_1	17.89 GiB	30.53 B	CPU	8	pp512	76.34 ± 0.33
qwen3moe ?B Q4_1	17.89 GiB	30.53 B	CPU	8	tg128	16.20 ± 0.06

Maybe it changes depending on the quantization. In any case, the much faster processing is interesting when ingesting large documents or parts of code.

jimhamiru · August 16, 2025, 1:47am

Could’ve sworn when I was testing, it made a significant different (~25%) due to the Neon support. Maybe I’m mis-remembering (or maybe a newer commit correctly already detects Neon support)?

I just ran ik_llama on the same model and it gave me an impressive increase of PP (x2.46) but a 20% decrease in TG:

Really impressed by your figures here. Have you over-clocked your board? I feel like I remember someone (may’ve been you?) talking about that in another thread. Unsure why your TG is a bit slower. On mine, it seemed to kick up very very slightly. Either way, huge improvement. Really hope we’ll eventually get good NPU support on the Orion for stuff like that. There’s an ALIGNED_INT4 type mentioned in the C code, so it might be theoretically possible to support Q4 quants.

willy · August 16, 2025, 6:24am

Maybe initially you didn’t use “-mcpu=native” and gcc didn’t enable it by default ? Note also I’m using gcc-14.2, maybe a previous one doesn’t enable it ? I remember that with the initial gcc (12 or 13, I don’t remember), -mcpu=native didn’t properly detect the CPU’s features.

Yep. I have 2x3.0, 2x2.7, 2x2.6 GHz. It’s not much compared to the initially advertised frequencies After all it’s only the medium cores running at 8% higher. My RAM is set to 6400 MT/s (though it doesn’t change anything from 5500 by default), and the internal coherent interconnect bus (CI-700) was raised from 1500 to 1700 MHz (which does help make better use of the RAM BW). This does have some effect, my RAM BW jumped from 40 to 46 GB/s. Of course we’re still far from the initially expected 100 but this is due to internal limitations in the chip. My understanding is that the DSU (which contains L3) capped at 1.3 GHz limits to communications with the L3 to 1.3Gx4*256 bits = ~166 GB/s, so it should not be an issue. However it’s connected to the CI-700 by a single 256b bus, and it’s not clear to me if it’s subject to this 1.3 GHz or not. Apparently not since I’m getting 46GB/s while this whould limit it to 41 otherwise. The NPU/GPU do not pass through DSU and are directly connected to the CI-700 so the RAM BW is shared between DSU and NPU+GPU.

My understanding of the architecture (but I could be wrong) is the following:
CPU clusters --(4x256b)–> DSU-120 @1.3 GHz (L3) --(1x256b)–> CI-700 @1.5 GHz --(128b x 6.4 MT/s)–> LPDDR5.

Thus we’d have a total of ~166 GB/s to L3 (which seems to roughly match my observations) but a bottleneck of 41 or 48 GB/s to CI-700 thus RAM. It’s not entirely clear to me how other chips deal with this because DSU-CI is 256b, and CI tops to 2 GHz on 5nm chips (i.e. 64 GB/s). Maybe larger chips implement multiple CI when they have multiple memory channels, I don’t know. Or maybe it’s possible to connect multiple DSU to a single CI.

This chip remains a bit awkward with internal bottlenecks that are lower than the external ones. However we need to keep in mind that it’s built around medium cores and targets a balance between efficiency, performance and cost. For example it still achieves a higher bandwidth than with a single DRAM channel, so even if it reaches certain limitations, this stems from a set of reasonable design choices that also consider cost and power usage.

udif · August 16, 2025, 10:16pm

jimhamiru:

Did you use this build command:

cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DCMAKE_C_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DCMAKE_CXX_FLAGS="-march=armv9-a+sve2+dotprod+i8mm+fp16+fp16fml+crypto+sha2+sha3+sm4+rcpc+lse+crc+aes+memtag+sb+ssbs+predres+pauth" -DGGML_NATIVE=off

I’ve copy-pasted your build commands directly into my shell.

willy · August 17, 2025, 6:00am

BTW I’ve retested on Q4_K_M to make sure we’re comparing apples to apples (I was on Q4_1 before) and PP is even slightly faster with llama:

model	size	params	backend	threads	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.35 GiB	30.53 B	CPU	8	pp512	34.02 ± 0.01
qwen3moe 30B.A3B Q4_K - Medium	17.35 GiB	30.53 B	CPU	8	tg128	21.09 ± 0.00

With ik-llama it’s slightly slower however:

model	size	params	backend	threads	test	t/s
qwen3moe ?B Q4_K - Medium	17.35 GiB	30.53 B	CPU	8	pp512	75.37 ± 0.38
qwen3moe ?B Q4_K - Medium	17.35 GiB	30.53 B	CPU	8	tg128	14.75 ± 0.20

ik-llama isn’t super stable for me. Some models fail to load or cause random crashes, but there’s probably some good stuff to pick there for the prompt processing.

jimhamiru · August 18, 2025, 12:01am

Thanks for the detail here. With the architecture you suspect of the Cix P1, do you know if this might potentially bode well for the NPU? It sounds like you might be suggesting that NPU bypasses L3 and has direct access to RAM via CI-700 (maybe I’ve misinterpreted though)? Any idea if there are any other constraints discovered as it relates to the NPU (e.g. I remember the RK3588 NPU only being able to access memory within the first 4GB range - or something to that effect).

ik-llama isn’t super stable for me. Some models fail to load or cause random crashes, but there’s probably some good stuff to pick there for the prompt processing.

I haven’t had ik-llama exlusive problems yet, but I do get seem to get some hard crashes while benching larger models (70B) on both ik-llama.cpp and llama.cpp. I haven’t actually done a memtest yet to see whether something might be faulty there though. If relevant, I’m using the Radxa Debian Image.

willy · August 18, 2025, 5:08am

I don’t think the NPU nor the GPU will have any issues as they’re connected outside of the DSU (dynamiq shared unit), thus closer to the memory controller. For me the limitation really is between DSU and CI-700. This, if one day the NPU would ever become generally supported, we could imagine doubling the token generation rate. On the other hand, with NPUs you have less choices of quantization, for example it will be either int4 or int8 but you won’t have optimal Q4_K_M or Q5_K_M which provide great compromises of performance, memory usage and quality, which means that if Q4_0 is too bad quality for you, you’ll need to switch to Q8_0, which will effectively consume twice the memory bandwidth, hence use all that was gained by bypassing the congested busses. Thus at least it could improve the quality at the same speed.

jimhamiru · August 19, 2025, 2:29am

Thanks for all the detail, really appreciate it. Really hoping we get some quality NPU libs soon so that we can see what the potential performance might actually be like (unfortunately, the Cix repos are still in pretty poor shape - and the LLM demos don’t make use of the NPU at all so far as I can tell).