Rock5B NPU low level access for amazing AI?

With models like whisper.cpp and llama.cpp running on CPU with amazing results (voice recognition, and a ChatGPT-like models, both running slowly on a Raspberry Pi 4), the obvious step would be accelerating the rate-limiting multiply-accumulate (MAC) calculations on the NPU.

Both models are already Int8 quantised, so perfect for the NPU. However, to get the NPU working, it should be used as a pure MAC accelerator, as I understand it can do over 1000 multiply-accumulates per clock cycle.

As an example, here’s news of llama running on a smartphone, generating about a word per second from just CPU:

The models are too big to just convert using RKNN (>7Gb), but I think if Radxa could help in the implementation, it would be a huge deal!

Here’s description of where NPU support could be implemented, from the whisper.cpp Roadmap FAQ:

One of the main goals of this implementation is to be very minimalistic and be able to run it on a large spectrum of hardware. The existing CPU-only implementation achieves this goal - it is bloat-free and very simple. I think it also has some educational value. Of course, not taking advantage of modern GPU hardware is a huge drawback in terms of performance. However, adding a dependency on a certain GPU framework will tie the project with the corresponding hardware and will introduce some extra complexity. With that said, adding GPU support to the project is low priority.
In any case, it would not be too difficult to add initial support. The main thing that needs to be offloaded to the GPU is the GGML_OP_MUL_MAT operator:

[whisper.cpp/ggml.c](https://github.com/ggerganov/whisper.cpp/blob/c71363f14cb0d31efe5c74f148b268da649050d9/ggml.c#L6231-L6234)

Lines 6231 to 6234 in [c71363f](https://github.com/ggerganov/whisper.cpp/commit/c71363f14cb0d31efe5c74f148b268da649050d9)

case GGML_OP_MUL_MAT:
{
ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor);
} break;

This is where more than 90% of the computation time is currently spent. Also, I don’t think it’s necessary to offload the entire model to the GPU. For example, the 2 convolution layers at the start of the Encoder can easily remain on the CPU as they are not very computationally heavy. Not uploading the full model to VRAM will make it require less memory and thus make it compatible with more video cards.

I think we should be asking rockchip, as they’re the ones responsible for the SDKs.

We should ask Radxa to ask Rockchip!

We are all buying Radxa products for the performance and NPU. They would have a better chance at getting document and an API for the NPU. It would add a huge amount of value to be able to run language models on a Rock 5B.

Ooh not that model, I had something more ambitious in mind, basically ChatGPT on the Rock 5B!

Yes, it already runs on the Rock 5b ( the 7B parameter model at least). Whats interesting though is that it has been fine-tuned and rereleased as the 7B-parameter Alpaca model, with benchmarks similarly to InstructGPT!

ChatGPT and GPT3 are matched by the leaked Llama large language model, and the code has been ported to C++:

It runs at a word per second or so on a Raspberry Pi 4. If it was accepted by the NPU, you could have something like ChatGPT running on the Rock 5b. Basically, 90% of the calculations on the C++ live in the one matmul call. If that could be offloaded to the NPU, and we could get a 10X boost in performance, and get a useable offline digital assistant.

3 Likes

What’s the status of this? I use text-generation-webui on my desktop, but would be cool to see the NPU being used for LLMs.

I can see there is a separate header file in rockchip NPU SDK at https://github.com/rockchip-linux/rknpu2/blob/master/runtime/RK3588/Linux/librknn_api/include/rknn_matmul_api.h, which contains the API that does matrix multiplication.
So I guess it’s now possible after all?

I still think you guys are confusing things as its a 3 core 2 Tops Int8 NPU where the ratings are maximums whilst using the small reserved memory area to hold weights files.
Hence why all the RKNPU2 model zoo examples are relatively small models.
What the ratings are when marshalling data in and out of the NPU’s addressable mem area I dunno, but likely not what your expecting.
The GPU is pretty mighty also and with ArmNN with the tensloflow delegate set to GPU its seems approx 75% or the Neon optimised CPU.