Go-rknnlite : Go language bindings for RKNN Tookit2

3djelly · April 10, 2024, 1:31am

I have recently created go-rknnlite, a set of Go language bindings for the rknn-toolkit2 C API interface. These allow you to use the Go programming language to perform inference on the RK3588 NPU.

It features a Pooled Runtime mode where you can run multiple RKNN instances of the same Model across all three NPU cores. In our EfficentNet-Lite0 Model average inference time is 7.9ms per image when running on a single NPU core, but by running a Pool of 9 runtimes across all three cores brings inference speed down to 1.65ms per image.

For price/performance the RK3588 NPU is the best option around for Edge AI applications. Some benchmarks of our model running on different platforms.

The following table provides a comparison of the performance.

Device	First Inference	Second Inference
Jetson Orin Nano 8GB - CUDA	3-4 sec	14-18ms
Jetson Orin Nano 8GB - CPU	N/A	30ms
Raspberry Pi 4B	150ms	92ms
Raspberry Pi 5	67ms	50ms
Khadas VIM3 Pro	106ms	78ms
Rock Pi 5B - CPU	65-70ms	44ms
Rock Pi 5B - NPU (Single Core)	12ms	6-7ms
Rock Pi 5B - NPU (3 Cores, 9 Threads)	N/A	1.65ms
Raspberry Pi CM4 with Hailo-8 (Blocking API)	11ms	4.2ms
Raspberry Pi CM4 with Hailo-8 (Streaming API)	N/A	1.2ms
Threadripper Workstation - USB3 Coral		9-11ms
Raspberry Pi CM4 - USB2 Coral		20-27ms
Raspberry Pi 5 - USB2 Coral		20-24ms
Raspberry Pi 5 - USB3 Coral		9-12ms
Raspberry Pi 4B - USB2 Coral		20-27ms
Raspberry Pi 4B - USB3 Coral		11-18ms

hipboi · April 10, 2024, 10:25am

Hi, @3djelly

This is nice. I think the ROCK 5C Lite and CM5 Lite would make the price/performance even better? We want to send you a ROCK 5C Lite to compare the performance difference. Send you a PM.

3djelly · April 23, 2024, 3:45am

Thanks for sending the Rock 5C Lite. I received it today and ran our benchmark for comparison to the Rock 5B using the pooled runtimes to spread inference across all NPU cores.

Number of Runtimes	Execution Time: 5B, 5C Lite	Average Inference Time Per Image: 5B, 5C Lite
1	59.97s, 66.91s	7.91ms, 8.83ms
2	34.56s, 33.76s	4.55ms, 4.45ms
3	22.94s, 23.75s	3.02ms, 3.13ms
6	13.89s, 18.16s	1.83ms, 2.40ms
9	12.54s, 17.37s	1.65ms, 2.29ms
12	11.97s, 16.69s	1.57ms, 2.20ms
15	12.03s, 16.63s	1.58ms, 2.19ms

Given the optimal of 9 pool runtimes, the Rock 5B runs average inference per image at 1.65ms and the 5C Lite at 2.29ms.

The performance is very good versus price, just need to wait for Arace to ship my CM5 order so I can build a carrier board and put it to use in a prototype product I’m developing.

3djelly · April 28, 2024, 8:49am

I have added support and examples for Object Detection using YOLOv5 and YOLOv8

3djelly · May 3, 2024, 4:53am

My Radxa CM5 order turned up today and I get even faster inference times than the Rock 5B.

Using 3 NPU cores and 9 models in a pool the 5B (RK3588) has average image inference time of 1.65ms, on the CM5 (RK3588S2) it is 1.21ms.

This makes it as fast the the streaming API on the Hailo-8 M.2 card which costs around $200 USD.