Go-rknnlite : Go language bindings for RKNN Tookit2

I have recently created go-rknnlite, a set of Go language bindings for the rknn-toolkit2 C API interface. These allow you to use the Go programming language to perform inference on the RK3588 NPU.

It features a Pooled Runtime mode where you can run multiple RKNN instances of the same Model across all three NPU cores. In our EfficentNet-Lite0 Model average inference time is 7.9ms per image when running on a single NPU core, but by running a Pool of 9 runtimes across all three cores brings inference speed down to 1.65ms per image.

For price/performance the RK3588 NPU is the best option around for Edge AI applications. Some benchmarks of our model running on different platforms.

The following table provides a comparison of the performance.

Device First Inference Second Inference
Jetson Orin Nano 8GB - CUDA 3-4 sec 14-18ms
Jetson Orin Nano 8GB - CPU N/A 30ms
Raspberry Pi 4B 150ms 92ms
Raspberry Pi 5 67ms 50ms
Khadas VIM3 Pro 106ms 78ms
Rock Pi 5B - CPU 65-70ms 44ms
Rock Pi 5B - NPU (Single Core) 12ms 6-7ms
Rock Pi 5B - NPU (3 Cores, 9 Threads) N/A 1.65ms
Raspberry Pi CM4 with Hailo-8 (Blocking API) 11ms 4.2ms
Raspberry Pi CM4 with Hailo-8 (Streaming API) N/A 1.2ms
Threadripper Workstation - USB3 Coral 9-11ms
Raspberry Pi CM4 - USB2 Coral 20-27ms
Raspberry Pi 5 - USB2 Coral 20-24ms
Raspberry Pi 5 - USB3 Coral 9-12ms
Raspberry Pi 4B - USB2 Coral 20-27ms
Raspberry Pi 4B - USB3 Coral 11-18ms
3 Likes

Hi, @3djelly

This is nice. I think the ROCK 5C Lite and CM5 Lite would make the price/performance even better? We want to send you a ROCK 5C Lite to compare the performance difference. Send you a PM.

1 Like

Thanks for sending the Rock 5C Lite. I received it today and ran our benchmark for comparison to the Rock 5B using the pooled runtimes to spread inference across all NPU cores.

Number of Runtimes Execution Time: 5B, 5C Lite Average Inference Time Per Image: 5B, 5C Lite
1 59.97s, 66.91s 7.91ms, 8.83ms
2 34.56s, 33.76s 4.55ms, 4.45ms
3 22.94s, 23.75s 3.02ms, 3.13ms
6 13.89s, 18.16s 1.83ms, 2.40ms
9 12.54s, 17.37s 1.65ms, 2.29ms
12 11.97s, 16.69s 1.57ms, 2.20ms
15 12.03s, 16.63s 1.58ms, 2.19ms

Given the optimal of 9 pool runtimes, the Rock 5B runs average inference per image at 1.65ms and the 5C Lite at 2.29ms.

The performance is very good versus price, just need to wait for Arace to ship my CM5 order so I can build a carrier board and put it to use in a prototype product Iā€™m developing.

3 Likes

I have added support and examples for Object Detection using YOLOv5 and YOLOv8

My Radxa CM5 order turned up today and I get even faster inference times than the Rock 5B.

Using 3 NPU cores and 9 models in a pool the 5B (RK3588) has average image inference time of 1.65ms, on the CM5 (RK3588S2) it is 1.21ms.

This makes it as fast the the streaming API on the Hailo-8 M.2 card which costs around $200 USD.