IMX219 + NPU real-time object detection on Zero 3W (experimental)

I have 16ms-20ms on my custom trained yolov5n model. I don’t think it’s being miscalculated

Inference time = 20 ms
model is NHWC input fmt

So certainly look into switching for a faster model to improve framerate.

I’ll be looking into how to fix the latency as you say

By the way I had to do some weird stuff on to the SDL3 code to get it to work, and it still has a couple of issues on colors. But I’ll fix that later

Looks pretty good, mind sharing your model?

Looks good. My imx-219 setup max fps is 30 fps (1920x1080), i know rockchip had a patch for 48 fps, but i could not find it (they deleted if i guess), do you have that patch?

I don’t have it. But such a patch would be to the IMX219 driver I think. You would just require to find the correct register map for 60fps and plug it inside the imx219.c driver ( drivers/media/i2c/imx219.c ) it should be possible to find that register map somewhere. Probably on Alibaba as crazy as it sounds, I sourced my IMX586 from Alibaba and my seller gave me a lot of classified documentation and multiple register maps for different settings

Zero 3W, Real-time, custom model yolov5n.rknn - 80 objects (1920x1080) result:

Tasks: 157 total,   1 running, 156 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.5 us,  6.8 sy,  0.0 ni, 81.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :   1977.7 total,   1343.1 free,    222.4 used,    412.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1568.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
   4731 rock       1 -19  510184 109092  84504 D  50.2   5.4   0:10.19 rknn-v4+ 
    291 root      20   0 1081888   8788   4748 S  13.2   0.4   2:17.71 rkaiq_3+ 
    801 rock      20   0  425580  57808  35592 S   9.2   2.9   0:55.45 weston   
   3478 root       0 -20       0      0      0 I   1.3   0.0   0:02.38 kworker+ 
    193 root     -51   0       0      0      0 S   1.0   0.0   0:03.27 irq/30-+ 
   4305 root       0 -20       0      0      0 I   1.0   0.0   0:01.27 kworker+ 
   4765 rock      20   0    7392   3200   2612 R   0.7   0.2   0:00.07 top      
     11 root      20   0       0      0      0 I   0.3   0.0   0:04.39 rcu_sch+ 
   4491 root      20   0       0      0      0 I   0.3   0.0   0:00.14 kworker+ 
   4655 root      20   0       0      0      0 D   0.3   0.0   0:00.15 kworker+ 
      1 root      20   0  166004  10168   7432 S   0.0   0.5   0:04.09 systemd  
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.04 kthreadd 
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp   
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par+ 
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_perc+ 
      9 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tas+ 
     10 root      20   0       0      0      0 S   0.0   0.0   0:00.37 ksoftir+

On your screenshot you have 18.3 FPS frame rate and 34ms inference time.

Do you have additional post-processing going on outside of the 34ms inference time which is causing the lower FPS rate? As 1000/18.3=54ms suggests to me your code has another 20ms post-processing.

Drawing TTF text and Rects , i think that is expensive. This is post inference.
Not to mention GPU is slow, or run slow. I don’t have any rk3566 to compare to, only rk3568 which is much, much faster.

Do you have Zero 3W around to check if my findings below are correct?

cat /sys/class/devfreq/fde40000.npu/cur_freq 
900000000
cat /sys/class/devfreq/fde60000.gpu/cur_freq 
400000000

cat ./devices/system/cpu/cpufreq/policy0/scaling_available_frequencies
408000 600000 816000 1104000 1416000

cat ./devices/system/cpu/cpufreq/policy0/scaling_max_freq
1416000

I have my DT like this:

cpu0_opp_table: cpu0-opp-table {
  compatible = "operating-points-v2";
  opp-shared;

  mbist-vmin = <825000 900000 950000>;
  nvmem-cells = <&cpu_leakage>, <&core_pvtm>, <&mbist_vmin>, <&cpu_opp_info>;
  nvmem-cell-names = "leakage", "pvtm", "mbist-vmin", "opp-info";
  rockchip,max-volt = <1200000>;
  rockchip,pvtm-voltage-sel = <
   0 84000 0
   84001 87000 1
   87001 91000 2
   91001 100000 3
  >;
  rockchip,pvtm-freq = <408000>;
  rockchip,pvtm-volt = <900000>;
  rockchip,pvtm-ch = <0 5>;
  rockchip,pvtm-sample-time = <1000>;
  rockchip,pvtm-number = <10>;
  rockchip,pvtm-error = <1000>;
  rockchip,pvtm-ref-temp = <40>;
  rockchip,pvtm-temp-prop = <26 26>;
  rockchip,thermal-zone = "soc-thermal";
  rockchip,temp-hysteresis = <5000>;
  rockchip,low-temp = <0>;
  rockchip,low-temp-adjust-volt = <

      0 1992 75000
  >;

  opp-408000000 {
   opp-hz = /bits/ 64 <408000000>;
   opp-microvolt = <850000 850000 1150000>;
   opp-microvolt-L3 = <900000 900000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-600000000 {
   opp-hz = /bits/ 64 <600000000>;
   opp-microvolt = <850000 850000 1150000>;
   opp-microvolt-L3 = <900000 900000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-816000000 {
   opp-hz = /bits/ 64 <816000000>;
   opp-microvolt = <850000 850000 1150000>;
   opp-microvolt-L3 = <900000 900000 1150000>;
   clock-latency-ns = <40000>;
   opp-suspend;
  };
  opp-1104000000 {
   opp-hz = /bits/ 64 <1104000000>;
   opp-microvolt = <900000 900000 1150000>;
   opp-microvolt-L0 = <900000 900000 1150000>;
   opp-microvolt-L1 = <850000 850000 1150000>;
   opp-microvolt-L2 = <850000 850000 1150000>;
   opp-microvolt-L3 = <900000 900000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-1416000000 {
   opp-hz = /bits/ 64 <1416000000>;
   opp-microvolt = <1025000 1025000 1150000>;
   opp-microvolt-L0 = <1025000 1025000 1150000>;
   opp-microvolt-L1 = <975000 975000 1150000>;
   opp-microvolt-L2 = <950000 950000 1150000>;
   opp-microvolt-L3 = <1000000 1000000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-1608000000 {
   opp-hz = /bits/ 64 <1608000000>;
   opp-microvolt = <1100000 1100000 1150000>;
   opp-microvolt-L0 = <1100000 1100000 1150000>;
   opp-microvolt-L1 = <1050000 1050000 1150000>;
   opp-microvolt-L2 = <1025000 1025000 1150000>;
   opp-microvolt-L3 = <1000000 1000000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-1800000000 {
   opp-hz = /bits/ 64 <1800000000>;
   opp-microvolt = <1150000 1150000 1150000>;
   opp-microvolt-L0 = <1150000 1150000 1150000>;
   opp-microvolt-L1 = <1100000 1100000 1150000>;
   opp-microvolt-L2 = <1075000 1075000 1150000>;
   opp-microvolt-L3 = <1050000 1050000 1150000>;
   clock-latency-ns = <40000>;
  };
  opp-1992000000 {
   opp-hz = /bits/ 64 <1992000000>;
   opp-microvolt = <1150000 1150000 1150000>;
   opp-microvolt-L0 = <1150000 1150000 1150000>;
   opp-microvolt-L1 = <1150000 1150000 1150000>;
   opp-microvolt-L2 = <1125000 1125000 1150000>;
   opp-microvolt-L3 = <1100000 1100000 1150000>;
   clock-latency-ns = <40000>;
  };
 };

How to set 1.8 Ghz, any idea?

Yeah forgot exactly but there is Rockchip/Radxa kernel code that excludes the 18000000 opp for some reason that can confuse, forgot where I got that info and if rockchip or radxa but it had an if statement.

But my board is running at 1.4 GHz i think.
Someone here on the forum claimed Joshua’s Ubuntu Image runs at 1.8 GHz on 3W but i don’t have any SD card available to try out. And Ubuntu Desktop is bloated…

Dunno Avaf as my memory is terrible but did have it running at 1.8.

kernel/drivers/soc/rockchip/rockchip_opp_select.c

Thanks for the info, i will try to find how to do that.

I think you just change the opp table name where opp-1608000000 becomes opp-1600000000 and same with 1.8 Ghz to something else.
Apols for forgetting but something like that as didn’t run a custom kernel

Do you still have your 3W around with Joshua’s Image? If you do, is it possible to post here the running dtb (ziped)? Thank you!

I found about the rk3566t limitation in the code, maybe i need to recompile it.

Prob not as they now seem to lack SD cards.
You can hack the code but from the discord conv lower down I make a comment on 1,6ghz and postd the dts changes. I think the OS just loads the opp on ordinal and doesn’t care about opp-name, so it was just a dts to dtb change and no kernel compile needed.
Its just the code that looks for specific opp-names to delete.

I don’t think CPU usage is the issue, I have around 7-10% CPU usage on my end and I have 28fps-32fps on my 16ms inference model.Tomorrow I’ll try other models

I think this would greatly benefit from multithreading, even if just to run the video display async from the RKNN thread, at least the video would always look smooth even with low inference framerate

EDIT: 50% CPU usage??? Maybe that’s definitely the issue on your end!

:thinking: Are you using nyanmisaka zero-copy FFMPEG?

By the way there is always the posibility my 16ms inference time calculation was wrong somehow. I’ll definitely double check on that tomorrow. In any case I am sastified with my current framerate, multithreading the RKNN section would fix any possible issue down the road

@stuartiannaylor
Following your suggestion i managed to get 1.8 available, thanks.
Let’s see how it performs now.

cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies 
408000 600000 816000 1104000 1416000 1608000 1800000 

cat /sys/devices/system/cpu/cpufreq/policy0/cpuinfo_cur_freq
1800000

@Avinadad_Mendez

I have 5 FFmpeg installed, i need to review the code and see what is wrong with that, but 10% CPU usage on X11 is pretty impressive. I could only achieve that with FFmpeg using DRM (no wayland, gbm or X11).
Anyway, FFmpeg converts the buffer to DRM_PRIME buffer i think and i am dealing with and rendering RGB24 buffer. On my first post i had ~ 17% CPU usage.

I need to double-check if i made some mistakes and see if i can run with 1.8 GHz without damaging the board.

The custom rknn model had a performance increase of 20%, your 16 ms may be correct. Thanks.

1 Like

I got 10% with plain X but without any desktop enviroment. With enlightenment i’d probably get around 30%. I’ll test more today

Board running 6 hrs, CPU freq 1.8 GHz, at least stable.

cat /sys/class/devfreq/fde60000.gpu/cur_freq
800000000
cat /sys/class/devfreq/fde40000.npu/cur_freq
900000000
cat /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq
1800000

Temp (on idle):

cat /sys/devices/virtual/thermal/thermal_zone0/temp
55555
cat /sys/devices/virtual/thermal/thermal_zone1/temp
54375

@Avinadad_Mendez
Sorry for the wrong info, rknn-v4l2 is not using ffmpeg, that may be the reason of high CPU usage.