I have recently created go-rknnlite, a set of Go language bindings for the rknn-toolkit2 C API interface. These allow you to use the Go programming language to perform inference on the RK3588 NPU.
It features a Pooled Runtime mode where you can run multiple RKNN instances of the same Model across all three NPU cores. In our EfficentNet-Lite0 Model average inference time is 7.9ms per image when running on a single NPU core, but by running a Pool of 9 runtimes across all three cores brings inference speed down to 1.65ms per image.
For price/performance the RK3588 NPU is the best option around for Edge AI applications. Some benchmarks of our model running on different platforms.
The following table provides a comparison of the performance.
Device | First Inference | Second Inference |
---|---|---|
Jetson Orin Nano 8GB - CUDA | 3-4 sec | 14-18ms |
Jetson Orin Nano 8GB - CPU | N/A | 30ms |
Raspberry Pi 4B | 150ms | 92ms |
Raspberry Pi 5 | 67ms | 50ms |
Khadas VIM3 Pro | 106ms | 78ms |
Rock Pi 5B - CPU | 65-70ms | 44ms |
Rock Pi 5B - NPU (Single Core) | 12ms | 6-7ms |
Rock Pi 5B - NPU (3 Cores, 9 Threads) | N/A | 1.65ms |
Raspberry Pi CM4 with Hailo-8 (Blocking API) | 11ms | 4.2ms |
Raspberry Pi CM4 with Hailo-8 (Streaming API) | N/A | 1.2ms |
Threadripper Workstation - USB3 Coral | 9-11ms | |
Raspberry Pi CM4 - USB2 Coral | 20-27ms | |
Raspberry Pi 5 - USB2 Coral | 20-24ms | |
Raspberry Pi 5 - USB3 Coral | 9-12ms | |
Raspberry Pi 4B - USB2 Coral | 20-27ms | |
Raspberry Pi 4B - USB3 Coral | 11-18ms |