IMX219 + NPU real-time object detection on Zero 3W (experimental)

I’m testing the zero 3W with imx219 and NPU using ffmpeg and sdl2.
In this experiment, i used the ff-rknn with pure ffmpeg to capture the frames from imx219 and tried to detect the objects in real-time using YOLOv5s.

While i can grab the frames using ffplay and display it at 30 FPS i had some latency on inference, resulting in a very low FPS.

This is my first attempt to use NPU on this board, so i need to figure out what is wrong with my setup.
Good news it worked, bad news it was really slow.

I checked the npu health during the experiment, and it looks like this (1920x1080):

root@rzero-3w:/home/rock# cat /sys/kernel/debug/rknpu/freq
198000000
root@rzero-3w:/home/rock# cat /sys/kernel/debug/rknpu/load
NPU load: 63%
root@rzero-3w:/home/rock# cat /sys/kernel/debug/rknpu/version
RKNPU driver: v0.8.8
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone1/temp
70000
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
71666

Frame size of 640x480 i get FPS: 5.1
Frame size of 1920x1080:

./ff-rknn-v4l2 -f v4l2 -p nv12 -s 1920x1080 -i /dev/video0 -x 1920 -y 1080 -b 28 -a 40 -m ./model/RK356X/yolov5s-640-640.rknn
Model: ./model/RK356X/yolov5s-640-640.rknn - size: 8073664.
sdk version: 1.4.0 (a10f100eb@2022-09-09T09:07:14) driver version: 0.8.8
model input num: 1, output num: 3
model: 640x640x3
[video4linux2,v4l2 @ 0xaaaaf8b5ea60] ioctl(VIDIOC_G_INPUT): Inappropriate ioctl for device
[video4linux2,v4l2 @ 0xaaaaf8b5ea60] ioctl(VIDIOC_G_PARM): Inappropriate ioctl for device
[video4linux2,v4l2 @ 0xaaaaf8b5ea60] Time per frame unknown
[video4linux2,v4l2 @ 0xaaaaf8b5ea60] Stream #0: not enough frames to estimate rate; consider increasing probesize
INFO: SDL: compiled with=2.30.0 linked against=2.30.0
arm_release_ver of this libmali is 'g2p0-01eac0', rk_so_ver is '10'.
rga_api version 1.3.2_[0] (RGA is compiling with meson base: $PRODUCT_BASE)
loadLabelName ./model/coco_80_labels_list.txt
^[INFO: Program quit after 27939 ticks
Avg FPS: 3.6

This was not running with “performance” governor, to mitigate “throttle” or Temp issues…
Maybe someone got better results, which means something is wrong with my setup.

Note 1:
I was able to run rknpu at 396000000 which got 5 FPS (1920x1080)

Note 2:
Running “performance” governor
get 5.5 ~ 6 FPS (1920x1080)

Note 3:
No improvement yet.

cat /sys/class/devfreq/fde40000.npu/cur_freq 
900000000
cat /sys/class/devfreq/fde60000.gpu/cur_freq 
400000000
2 Likes

Changed to OpenCV and got better results… 9.5 ~ 10 FPS, no latency.

I dumped OpenCV and FFmpeg, optimized the code a bit (still single-threaded), SDK 1.5, and got some improvements 15 FPS (1920x1080), i need to draw some widgets on screen, i don’t think i can get better results than that with a single thread.

./rknn-v4l2 -f v4l2 -p NV12 -s 1920x1080 -i /dev/video0 -x 1920 -y 1080 -b 28 -a 40 -m ./model/RK356X/yolov5s-640-640.rknn
Model: ./model/RK356X/yolov5s-640-640.rknn - size: 7624064.
sdk version: 1.5.0 (e6fe0c678@2023-05-25T08:09:20) driver version: 0.8.8
model input num: 1, output num: 3
model: 640x640x3
INFO: SDL: compiled with=2.30.0 linked against=2.30.0
arm_release_ver of this libmali is 'g2p0-01eac0', rk_so_ver is '10'.
rga_api version 1.3.2_[0] (RGA is compiling with meson base: $PRODUCT_BASE)
loadLabelName ./model/coco_80_labels_list.txt
INFO: Program quit after 13539 ticks
INFO: Stop sensor device
INFO: Close sensor device
INFO: Free resize_buf: 0xffff8973f010
INFO: Destroy renderer: 0xaaaad6fb3e70
INFO: Destroy window: 0xaaaad6fbc280
Free rknn ctx: 187650726442880)
Free model data: 0xffff95792010
Avg FPS: 15.0

Monitoring:

top - 20:05:49 up  7:20,  0 users,  load average: 1.07, 0.67, 0.32
Tasks: 157 total,   1 running, 156 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.6 us,  2.1 sy,  0.0 ni, 93.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   1977.7 total,   1337.8 free,    206.0 used,    434.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1590.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
  13311 rock       1 -19  364668 100392  72424 D  17.5   5.0   0:41.35 rknn-v4+ 
    289 root      20   0 1081888   7592   3240 S  12.9   0.4   3:16.71 rkaiq_3+ 
     11 root      20   0       0      0      0 I   0.3   0.0   0:13.48 rcu_sch+ 
    190 root     -51   0       0      0      0 S   0.3   0.0   0:07.89 irq/30-+ 
  12165 root       0 -20       0      0      0 I   0.3   0.0   0:00.78 kworker+ 
  12935 root      20   0       0      0      0 I   0.3   0.0   0:01.01 kworker+ 
  13197 root      20   0       0      0      0 I   0.3   0.0   0:00.26 kworker+ 
  13310 root      20   0    7392   2724   2136 R   0.3   0.1   0:01.68 top      
  13321 root      20   0       0      0      0 I   0.3   0.0   0:00.32 kworker+ 
      1 root      20   0  165964   7780   5032 S   0.0   0.4   0:05.66 systemd  
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.11 kthreadd 
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp   
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par+ 
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_perc+ 
      9 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tas+ 
     10 root      20   0       0      0      0 S   0.0   0.0   0:01.52 ksoftir+ 
     12 root      rt   0       0      0      0 S   0.0   0.0   0:00.32 migrati+

CPU Temp:

root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72777

Opengles2:

Hello, Avaf. 15FPS is a great result! Would you mind sharing your code?

These experiments are sponsored for a possible project, I have permission to share the results at this time. If the project doesn’t go ahead and I get the green light to share more code more than it has been done here: IMX415 + NPU demo on ROCK 5B
Follow the results and comparisons that will be carried out here and there if you are more interested.
Hey, but you can become a sponsor too. :wink:

Actually i get 19 FPS as a POC but i need to get some real world results to see if its viable.

Low Light condition test, the $10 camera (shipping included) is for close up anyway.
The drawback of such a high FPS for this tiny board, it gets ~85 C Temp (long run). They sent me a big radiator, let’s if it fits and can then reduce the temp.

See how it runs on HDMI USB touch (7 inch):
https://mega.nz/file/4GQWkIwJ#gkVQpYJ6nPlZUBnFmMyN-kE9Wy2--hkw7y2rdWcKDnw

The next step is to check if i can run a second instance and get rid of the cables.

My suggestion to Radxa team, launch a new board, with a similar layout but with rk3568 instead of the rk3566-T , 3 usb-c instead of 2, CSI connector at one end, DSI at the other end. People may complain it would get a bit bigger and would not fit into an rpi zero w case, but i think the current layout does not fit as well (i haven’t try it yet… :smile: )

Update:

  • include RTC backup
  • micro HDMI (optional?? or without it to save space and cool down the temp)

I would be willing to pay $10 more for such a board and use it with 8" or 10" display

And finally, would it be too much to ask for rk3588s on such a board?

Today i got the heatsink installed, it fits like this:

It looks messy but it works, i am glad no shorts occurred.

Running for 30 min… (performance, max npu and gpu freq.)

root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/type
soc-thermal
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
64444
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
65000
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
65000
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
66250
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
66875
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
66875
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
67500
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
68125
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
71111
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72222
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72222
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72222
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72777
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72777
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72222
root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
72222

It does the job, but if there’s no air flowing through the fins, I think it can get to 75 C and stabilize.

Update:

uptime 2:06 , no airflow or ventilation:

root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
75625

Update 2:

uptime 3:27, no airflow or ventilation, temp is stable now:

root@rzero-3w:/home/rock# cat /sys/devices/virtual/thermal/thermal_zone0/temp
76250

1 Like

How can i get in touch with you to discuss becoming a sponsor ?

Please send me a private message at @avaf.

yolov5s is way slower than some of the other models out there. You can probably get up to 2x or 2.5x framerate with yolov5n or yolov6n

See this repo and look at the benchmarks

I also did the comparisons myself and I did get a gigantic boost by using yolov5n

I stopped researching / testing inference a while ago, it would be nice to see some benchmarks about this. If you can document your findings and the code you used would be nice.

I expect to get a rock-5c next month so i can review my code on this board with kernel 6.1 and a working CSI, hopefully. Arace seems to take a long time to dispatch the goods, but I asked for special treatment, this could be the reason. Allnet was quick to ship and handle in the same situation.

BTW, the RK3566-T HDMI output is extremely slow, i hope you are not comparing to my results above here.

You can see code and benchmarks on the repo I listed :slight_smile: I think I had around 0.02ms inference but I was bottlenecked by CPU (I used OpenCV)

What is Rk3566-T? I assume it is Rk3566. I use MIPI DSI at 93hz

Man, you’re asking for trouble… :laughing:
Let’s just say it is. By HDMI output i mean the GPU. Slower CPU, slower GPU, and sometimes no NPU.

I mean practical benchmarks not theoretical. I did not see any improvement with yolov8 on rk3588 for example.

My inference times went down significantly when I used yolov5n, to around 0.02s as I said. I do a lot of experimentation everyday so I would have to dig up to find that code. Anyway I’ll share my results when I get the SDL3 code working

Yeah RK3566 is slower than RK3568 but not by a lot. Never heard of any RK3566-T so I don’t know if that’s something new. I am using RK3566. Custom board ( https://imgur.com/a/86vrNDY )

Are you sure about your metric?

30 FPS is 33.33 ms for each frame.

Not quite sure it was many months ago. But it was around that number. In any case I was heavily CPU-bottlenecked back then and pre-processing / post-processing tanked my framerate.

I am working hard on getting the SDL code working and then I’ll share my results

EDIT: sorry I meant 0.02s not 0.02ms. That would be insane

I apparently have 28.4fps on the SDL3 code running on my RK3566

Avg FPS: 28.4

Playback looks very smooth, but I am having an issue, as my camera is apparently running at 37.5fps, or at least that’s what the framerate calculation says when I delete all NPU operations from the program, this means my playback has an ugly 2 second delay, because it is not discarding any frames from the camera and is trying to keep up.

I also have some weird color issues but that may be just be my very old RGA version, I don’t know

Anyway framerate is good but I would really need to slow down the camera to not have delay on the playback

By the way CPU usage is 7%. Awesome! When I tried to use OpenCV with this resolution I got like 10fps and 45% CPU usage, which was a bottleneck as it was single threaded

Looks good. My imx-219 setup max fps is 30 fps (1920x1080), i know rockchip had a patch for 48 fps, but i could not find it (they deleted if i guess), do you have that patch?

For reference, here is my finding regarding Inference:
Inference on bus.jpg, single core, single thread

|Board / model|   Yolov5s  |  Yolov8n   |
|-------------|------------|------------|
|Zero 3W      |60.970000 ms|52.175000 ms|
|Rock 5B      |21.875000 ms|21.251000 ms|

BTW, i posted how to fix the latency, you need to consume the frames buffered by ffmpeg, so 2 seconds at 30 fps will be something like 60 frames, so read 65 frames and discard them, before you start processing it.

I have 16ms-20ms on my custom trained yolov5n model. I don’t think it’s being miscalculated

Inference time = 20 ms
model is NHWC input fmt

So certainly look into switching for a faster model to improve framerate.

I’ll be looking into how to fix the latency as you say

By the way I had to do some weird stuff on to the SDL3 code to get it to work, and it still has a couple of issues on colors. But I’ll fix that later