I’m actually really impressed by the latest Qwen3-30B-A3B released on 2025-07-25:
https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF
In terms of Q4_K_M, I went with that quant because it seems to be a good trade-off in terms of perplexity loss and performance. In my experience, for simple queries, Q4 is typically good enough.
If we go for larger quants:
- It would require more RAM.
- Token generation would be slower (e.g. a Q8 would probably halve token generation performance due to RAM bandwidth constraints on the O6).
That said, the NPU natively supports INT8, so it might be worth using a Q8 if the NPU ever becomes supported (would likely speed up Prompt Processing a lot).
There is also an ALIGNED_INT4 type in the NPU source-code, so it might be that INT4 is supported too (for efficient Prompt Processing of Q4 quants). I’m not confident on this though - if someone might be able to verify that’s the case, would appreciate it.
I don’t think the NPU SDK is in a very good state yet though. See the post here for details: C++ Example running YOLOv8 on the NPU
It might be a very long time before we’re able to use it with Llama (if ever).