With models like whisper.cpp and llama.cpp running on CPU with amazing results (voice recognition, and a ChatGPT-like models, both running slowly on a Raspberry Pi 4), the obvious step would be accelerating the rate-limiting multiply-accumulate (MAC) calculations on the NPU.
Both models are already Int8 quantised, so perfect for the NPU. However, to get the NPU working, it should be used as a pure MAC accelerator, as I understand it can do over 1000 multiply-accumulates per clock cycle.
As an example, here’s news of llama running on a smartphone, generating about a word per second from just CPU:
The models are too big to just convert using RKNN (>7Gb), but I think if Radxa could help in the implementation, it would be a huge deal!
Here’s description of where NPU support could be implemented, from the whisper.cpp Roadmap FAQ:
One of the main goals of this implementation is to be very minimalistic and be able to run it on a large spectrum of hardware. The existing CPU-only implementation achieves this goal - it is bloat-free and very simple. I think it also has some educational value. Of course, not taking advantage of modern GPU hardware is a huge drawback in terms of performance. However, adding a dependency on a certain GPU framework will tie the project with the corresponding hardware and will introduce some extra complexity. With that said, adding GPU support to the project is low priority.
In any case, it would not be too difficult to add initial support. The main thing that needs to be offloaded to the GPU is theGGML_OP_MUL_MAT
operator:
[whisper.cpp/ggml.c](https://github.com/ggerganov/whisper.cpp/blob/c71363f14cb0d31efe5c74f148b268da649050d9/ggml.c#L6231-L6234)
Lines 6231 to 6234 in [c71363f](https://github.com/ggerganov/whisper.cpp/commit/c71363f14cb0d31efe5c74f148b268da649050d9)
case GGML_OP_MUL_MAT:
{
ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor);
} break;
This is where more than 90% of the computation time is currently spent. Also, I don’t think it’s necessary to offload the entire model to the GPU. For example, the 2 convolution layers at the start of the Encoder can easily remain on the CPU as they are not very computationally heavy. Not uploading the full model to VRAM will make it require less memory and thus make it compatible with more video cards.