I played a bit more with ff-rknn on X11 and added some parameters to detect a single object.
DISPLAY=:0.0 ./ff-rknn -i ~/weston/apps/videos_rknn/vid-13.mp4 -x 960 -y 540 -l 960 -t 0 -m ./model/RK3588/yolov5s-640-640.rknn -b 40 -o bird -a 60
Where -a = confidence, and -b = alpha-blend mask
Some benchmarks showed still reasonable to display 12 streams and record a screencast at the same time. I tried to use the NPU SRAM to reduce DDR bandwidth described here: (rknpu2/doc/RK3588_NPU_SRAM_usage.md at master · rockchip-linux/rknpu2 · GitHub) and ROCK 5B Debug Party Invitation - #689 by avaf but the results were not good.
screencast (3840x1080):

