Rock5B NPU low level access for amazing AI?

With models like whisper.cpp and llama.cpp running on CPU with amazing results (voice recognition, and a ChatGPT-like models, both running slowly on a Raspberry Pi 4), the obvious step would be accelerating the rate-limiting multiply-accumulate (MAC) calculations on the NPU.

Both models are already Int8 quantised, so perfect for the NPU. However, to get the NPU working, it should be used as a pure MAC accelerator, as I understand it can do over 1000 multiply-accumulates per clock cycle.

As an example, here’s news of llama running on a smartphone, generating about a word per second from just CPU:

The models are too big to just convert using RKNN (>7Gb), but I think if Radxa could help in the implementation, it would be a huge deal!

Here’s description of where NPU support could be implemented, from the whisper.cpp Roadmap FAQ:

One of the main goals of this implementation is to be very minimalistic and be able to run it on a large spectrum of hardware. The existing CPU-only implementation achieves this goal - it is bloat-free and very simple. I think it also has some educational value. Of course, not taking advantage of modern GPU hardware is a huge drawback in terms of performance. However, adding a dependency on a certain GPU framework will tie the project with the corresponding hardware and will introduce some extra complexity. With that said, adding GPU support to the project is low priority.
In any case, it would not be too difficult to add initial support. The main thing that needs to be offloaded to the GPU is the GGML_OP_MUL_MAT operator:

[whisper.cpp/ggml.c](https://github.com/ggerganov/whisper.cpp/blob/c71363f14cb0d31efe5c74f148b268da649050d9/ggml.c#L6231-L6234)

Lines 6231 to 6234 in [c71363f](https://github.com/ggerganov/whisper.cpp/commit/c71363f14cb0d31efe5c74f148b268da649050d9)

case GGML_OP_MUL_MAT:
{
ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor);
} break;

This is where more than 90% of the computation time is currently spent. Also, I don’t think it’s necessary to offload the entire model to the GPU. For example, the 2 convolution layers at the start of the Encoder can easily remain on the CPU as they are not very computationally heavy. Not uploading the full model to VRAM will make it require less memory and thus make it compatible with more video cards.

Good luck, it can’t even run yolov8s!

In the release notes for the 1.4.0 API they mention all these low level operators like matrix multiply etc but nowhere are these to be found at the user level. The NPU is geared toward semantic segmentation and object recognition using well-established pretrained models, and THAT’S IT! Access to the NPU runtime seems limited to “load model”, “init NPU”, 'Query model", “run inference”, “get output”, The end.

If you want to achieve this, getting OpenCL to run on their GPU is probably your best bet. Unless they open up the NPU low level runtime.

I think we should be asking rockchip, as they’re the ones responsible for the SDKs.

We should ask Radxa to ask Rockchip!

We are all buying Radxa products for the performance and NPU. They would have a better chance at getting document and an API for the NPU. It would add a huge amount of value to be able to run language models on a Rock 5B.

Well, theoretically you CAN, but the approach wont be to offload matrix multiply of 2 large tensors. You would have to take your whole model, convert it to ONNX, then use their janky conversion tool to convert it to their proprietary RKNN format and use the NPU API to submit it for inference. Their examples dont cover anything but convolutional nets for semantic segmentation and object recognition using various models like YOLO, InceptionSSD, and MobileNet. So in theory, yes, a conv net is just a deep net with convolutional layers, and the lack of examples for anything else may just be an oversight.
You mentioned the model was too big, but apparently there are versions that are smaller, here they mention a ‘tiny’ version:
Convert to ONNX · openai/whisper · Discussion #134 (github.com)
As far as lama, that doesnt even seem to encode to ONNX:
Export to Onnx · Issue #84 · advimman/lama (github.com)

Ooh not that model, I had something more ambitious in mind, basically ChatGPT on the Rock 5B!

Yes, it already runs on the Rock 5b ( the 7B parameter model at least). Whats interesting though is that it has been fine-tuned and rereleased as the 7B-parameter Alpaca model, with benchmarks similarly to InstructGPT!

ChatGPT and GPT3 are matched by the leaked Llama large language model, and the code has been ported to C++:

It runs at a word per second or so on a Raspberry Pi 4. If it was accepted by the NPU, you could have something like ChatGPT running on the Rock 5b. Basically, 90% of the calculations on the C++ live in the one matmul call. If that could be offloaded to the NPU, and we could get a 10X boost in performance, and get a useable offline digital assistant.

There are about 100 more things I would do with this frustrating board if I had that kind of low level NPU access :grimacing:

Thats pretty exciting from a robotics perspective. Rudimentary multimodality through speech recog, image recog, and LLM would really up the personal robotics game.