Object detection with NPU - h264 streams - h264 camera - rtsp

I optimized and enhanced the code to feed NPU with 4 streams (h264 1080p) to detect objects and scaled it down to fit in 1920x1080 display with a reasonable result.

The code runs in wayland/weston and keeping the CPU at low frequency and low Temp.

Object color:

  • car: green
  • person: red
  • bicycle: yellow
  • bus: pink
  • umbrella: white
  • motorcycle: blueish
  • anything else blue / blue tint

Video teaser

4 Likes

What are your frames like?

Not sure what you mean, the frames are 1920x1080, 29 fps, maybe rendered at 25 ~ 29 fps i guess.

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/apps/videos_rknn/vid-3.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42mp41isomavc1
    creation_time   : 2020-12-03T00:03:56.000000Z
  Duration: 00:00:27.39, start: 0.000000, bitrate: 5963 kb/s
  Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt709), 1920x1080, 5704 kb/s, 29.97 fps, 29.97 tbr, 30k tbn, 60k tbc (default)
    Metadata:
      creation_time   : 2020-12-03T00:03:56.000000Z
      handler_name    : L-SMASH Video Handler
      vendor_id       : [0][0][0][0]
      encoder         : AVC Coding
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 253 kb/s (default)
    Metadata:
      creation_time   : 2020-12-03T00:03:56.000000Z
      handler_name    : L-SMASH Audio Handler
      vendor_id       : [0][0][0][0]

Sorry, had short handed that. I had meant frames per second. That’s pretty damn good.

11 streamers (1080p) + realtime camera (1080p) but needs more CPU usage %

screencast (bps 2M) with ffmpeg while running the demo at same time: (https://github.com/hbiyik/FFmpeg)

Some benchmarks:

 CPU0-3  CPU4-5  CPU6-7     DDR     DSU     GPU     NPU
   1416     816     816    2112    1416     300    1000
   1800     600    1416    2112    1800     300    1000
   1200    1608     408    2112    1200     300    1000
    816    1416     408    2112     816     300    1000
   1416    1416    1416    2112    1416     300    1000
   1200     600     816    2112    1200     300    1000
    816     408     408    2112     816     300    1000
   1800    1416    1608    2112    1800     300    1000
    408    1416     408    2112     396     300    1000

Performance:

 CPU0-3  CPU4-5  CPU6-7     DDR     DSU     GPU     NPU
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000
   1800    2352    2256    2112    1800    1000    1000

NPU is the bottleneck.

     CPU0-3  CPU4-5  CPU6-7     DDR     DSU     GPU     NPU
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000
        408     408     408    2112     396    1000    1000

Realistically with the npu only 3 1080p30 video steams to get decent interfering at ~30/35 fps and that is with a dedicated core to each stream.

Biggest issue I’m finding with video decoding/npu/gpu isn’t the cpu usage but the thermals. With active cooling the temperatures are reasonable. However with passive cooling and processing 16+ streams the temperatures can easily progress towards 80 degrees (celsius). For embedded solutions we’d prefer passive cooling to avoid undue maintenance.

I don’t have issues with thermals, maybe running 24/7 or inside a case.

Passive cooling here:

Have you tried 16+ streams?

No. I haven’t.
I don’t have here with me a 2k or 4k display monitor to be able to display that.

Is this launched via a python script? And is any of this work open source, on GitHub, that I can quickly clone and replicate?

(If not, I understand, I don’t expect everything to be free, and realize there could be intellectual property in this work.)

It is C/C++.

I need to clean up the code and a way to handle rtsp, and release the code. Not sure about the license or any possible patent issue. I will push it to github when done.

2 Likes

Updating the progress on this. h264 rtsp is working, h264 USB camera and h264 streams. Cleaning up the code. The code will be licensed under Apache 2.0.

For this to work you will need:

  • Desktop with HW acceleration
  • FFmpeg with HW decoding
  • SDL3 with HW acceleration for Rockchip
  • librga and kernel patched for > 4GB

Current limitations are:

  • H264 only
  • Wayland/Weston positional windows need SDL and Weston hacks, not provided but you can find the patches on Rockchip and SDL github.
  • rtmp may be released soon or you can modify the code and add support for it., it’s straightforward.

I will try to provide all instructions later.

SDL3 for ROCKCHIP

If you don’t have SDL3 working on your board, you need to build it from sources and make sure 3D accel works.

git clone https://github.com/libsdl-org/SDL --depth=1
cd SDL
mkdir sdl3
cd sdl3
cmake .. -DSDL_TESTS=1 -DCMAKE_BUILD_TYPE=Release
make -j8
sudo cmake --install . --prefix /usr/local 
sudo ldconfig

Test it:

cd test/
./testgles2

You will see a spinning cube, when you exit you see the FPS, which should be 1000 ~ 3000, depending on the Desktop (X11 or Wayland). If you get ~ 300 it is software rendered, you missed some prerequisites for HW accel during the build.

screenshot for imx415 fisheye rtsp module and h264 USB camera without yolov5 inference on video.

@jack , please, can you ask Rockchip or @RadxaYuntian to update libmali deb package to rk4.1 so we can test this on latest kernel?

2 Likes

still? i have updated with hevc and vp8 encoder support in ffmpeg. nevermind, i thought this was encoder related.

Yeah, i think RGA outputs buf with padding, without padding it scrambles the result as i pointed out on the git encoder issue, i guess for the same reason.

That’s the reason i would like to run with rk4.1.

ok, i dont understand those npu things, but if you could give a duplication method, i can gdb it and fix.

npu is a separate layer, but it needs RGB888 buf.

Here is the relevant code you can test.

Pass a dma buf to be converted to buf.

src: DRM_FORMAT_NV12_10 / DRM_FORMAT_NV15 and DRM_FORMAT_NV16 (aka YCbCr_420_SP_10B and YCbCr_422_SP)
dst: RK_FORMAT_RGB_888

make this work and we are good to go with H265 and 10bit

static int drm_rga_buf(int src_Width, int src_Height, int src_fd, int src_format, int dst_Width, int dst_Height, int dst_format, int frameSize, char *buf)
{
	rga_info_t src;
	rga_info_t dst;
	int ret;
	int hStride = (src_Height + 15) & (~15);
	int wStride = (src_Width + 15) & (~15);
	//int dhStride = (dst_Height + 15) & (~15);
	//int dwStride = (dst_Width + 15) & (~15);

	memset(&src, 0, sizeof(rga_info_t));
	src.fd = src_fd;
	src.mmuFlag = 1;

	memset(&dst, 0, sizeof(rga_info_t));
	dst.fd = -1;
	dst.virAddr = buf;
	dst.mmuFlag = 1;

	rga_set_rect(&src.rect, 0, 0, src_Width, src_Height, wStride, hStride, src_format);
	rga_set_rect(&dst.rect, 0, 0, dst_Width, dst_Height, dst_Width, dst_Height, dst_format);

	ret = c_RkRgaBlit(&src, &dst, NULL);
	return ret;
}
1 Like

yeah ok, i know rga is being a b**tch when converting RGB, currently i also dont have so much idea about that. So basically rga converts when it inputs from h264_rkmpp_decoder but not from hevc or vp8?

It converts (drm nv12 10bit -> rga -> rgb888 ) but here is the result of Big_Buck_Bunny_1080_10s_30MB.mp4 (@nyanmisaka) to RGB_888

I have an FFmpeg that can convert it to YCbCr_420_P (i think it is I420, the one the ffplay uses) but it uses dst hstride and vstride, that’s what i mean by padding, but RGB_888 can’t be with padding.

I guess @mtx512rk had exactly the same problem, that is why only h264.

The problem may be in the kernel side (rga).

Here is the interesting part:

[268867.672676] rga_debugger: yuv2rgb mode is 1
[268867.672679] rga_debugger: set core = 0, priority = 0, in_fence_fd = 0
[268867.672685] rga_policy: start policy on core = 1
[268867.672689] rga_policy: start policy on core = 2
[268867.672691] rga_policy: start policy on core = 4
[268867.672694] rga_policy: optional_cores = 7
[268867.672697] rga_policy: assign core: 1
[268867.673041] rga_dma_buf: iova_align size = 6221824
[268867.675029] rga3_reg: render_mode:bitblt, bitblit_mode=0, rotate_mode:0
[268867.675034] rga3_reg: win0: y = fe1d0000 uv = fe3ce000 v = fe44d800 src_w = 1920 src_h = 1080
[268867.675039] rga3_reg: win0: vw = 1920 vh = 1088 xoff = 0 yoff = 0 format = YCbCr420SP
[268867.675043] rga3_reg: win0: dst_w = 1920, dst_h = 1080, rd_mode = 0
[268867.675048] rga3_reg: win0: rot_mode = 1, en = 1, compact = 1, endian = 0
[268867.675051] rga3_reg: wr: y = fcef0010 uv = fd0ea410 v = fd168d10 vw = 1920 vh = 1080
[268867.675056] rga3_reg: wr: ovlp_xoff = 0 ovlp_yoff = 0 format = RGB888 rdmode = 0
[268867.675059] rga3_reg: mmu: win0 = 00 win1 = 00 wr = 00
[268867.675062] rga3_reg: alpha: flag 0 mode0=0 mode1=0
[268867.675066] rga3_reg: blend mode is no blend
[268867.675069] rga3_reg: yuv2rgb mode is 0
[268867.675164] rga_job: job: reqeust_id = 154184, priority = 0, core = 1
[268867.678651] rga_job: request[154184] finished 1 failed 0
[268867.678742] rga: Blit mode: request id = 154185
[268867.678747] rga_debugger: render_mode = 0, bitblit_mode=0, rotate_mode = 1
[268867.678752] rga_debugger: src: y = 2f uv = 0 v = 1fe000 aw = 1920 ah = 1080 vw = 1920 vh = 1088
[268867.678756] rga_debugger: src: xoff = 0, yoff = 0, format = 0xa, rd_mode = 1
[268867.678760] rga_debugger: dst: y=0 uv=7f8a476010 v=7f8a4da010 aw=640 ah=640 vw=640 vh=640
[268867.678763] rga_debugger: dst: xoff = 0, yoff = 0, format = 0x2, rd_mode = 1
[268867.678765] rga_debugger: mmu: mmu_flag=80000521 en=1
[268867.678768] rga_debugger: alpha: rop_mode = 0
[268867.678770] rga_debugger: yuv2rgb mode is 1
[268867.678773] rga_debugger: set core = 0, priority = 0, in_fence_fd = 0
[268867.678780] rga_policy: start policy on core = 1
[268867.678783] rga_policy: start policy on core = 2
[268867.678786] rga_policy: start policy on core = 4
[268867.678789] rga_policy: optional_cores = 7
[268867.678792] rga_policy: assign core: 1
[268867.678903] rga_dma_buf: iova_align size = 1232896
[268867.679035] rga3_reg: render_mode:bitblt, bitblit_mode=0, rotate_mode:0
[268867.679039] rga3_reg: win0: y = fe1d0000 uv = fe3ce000 v = fe44d800 src_w = 1920 src_h = 1080
[268867.679043] rga3_reg: win0: vw = 1920 vh = 1088 xoff = 0 yoff = 0 format = YCbCr420SP
[268867.679046] rga3_reg: win0: dst_w = 640, dst_h = 640, rd_mode = 0
[268867.679049] rga3_reg: win0: rot_mode = 1, en = 1, compact = 1, endian = 0
[268867.679053] rga3_reg: wr: y = fd3b0010 uv = fd414010 v = fd42d010 vw = 640 vh = 640
[268867.679056] rga3_reg: wr: ovlp_xoff = 0 ovlp_yoff = 0 format = RGB888 rdmode = 0
[268867.679059] rga3_reg: mmu: win0 = 00 win1 = 00 wr = 00
[268867.679062] rga3_reg: alpha: flag 0 mode0=0 mode1=0
[268867.679064] rga3_reg: blend mode is no blend
[268867.679066] rga3_reg: yuv2rgb mode is 0
[268867.679084] rga_job: job: reqeust_id = 154185, priority = 0, core = 1
[268867.680672] rga_job: request[154185] finished 1 failed 0