Llama.cpp - Benchmarks

I’m actually really impressed by the latest Qwen3-30B-A3B released on 2025-07-25:

https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

In terms of Q4_K_M, I went with that quant because it seems to be a good trade-off in terms of perplexity loss and performance. In my experience, for simple queries, Q4 is typically good enough.

If we go for larger quants:

  1. It would require more RAM.
  2. Token generation would be slower (e.g. a Q8 would probably halve token generation performance due to RAM bandwidth constraints on the O6).

That said, the NPU natively supports INT8, so it might be worth using a Q8 if the NPU ever becomes supported (would likely speed up Prompt Processing a lot).

There is also an ALIGNED_INT4 type in the NPU source-code, so it might be that INT4 is supported too (for efficient Prompt Processing of Q4 quants). I’m not confident on this though - if someone might be able to verify that’s the case, would appreciate it.

I don’t think the NPU SDK is in a very good state yet though. See the post here for details: C++ Example running YOLOv8 on the NPU

It might be a very long time before we’re able to use it with Llama (if ever).

1 Like