NPU Support for llama.cpp, Ollama, Onnx etc

tsa · May 1, 2025, 8:35pm

Hi,

We have been trying to use NPU with llama.cpp, Ollama & Onnx.

With llama.cpp and Ollama, it only uses CPU, there seems to be no option of using NPU with these frameworks.

For Onnx, there is a “ZhouyiExecutionProvider” for loading and running models on the NPU. But this complains about some missing libraries and in the end it does not work.

To us it looks like we can only use NPU with NPU-optimized models from the CIX AI Model Hub using custom python scripts like inference_npu.py.

Do these observations seem right?

Thanks

willy · May 2, 2025, 11:38am

I have not found anything practical supporting the NPU either. Given that for LLMs the limiting factor will be DRAM bandwidth, I simply gave up searching. The benefit of the NPU could be during data ingestion and/or image analysis, but for token generation I strongly doubt we’d gain anything. The only gains I got to date were 10% by increasing the DRAM bandwidth by… 10%!

Ercole_Spiteri · May 6, 2025, 8:39am

Gaia found here:

https://www.amd.com/en/developer/resources/technical-articles/gaia-an-open-source-project-from-amd-for-running-local-llms-on-ryzen-ai.html

Allows running LLMs on NPU and GPU at same time.