Optimizing Object Detection on Radxa Zero 3 (RK3566) - Performance Issues

Hello Radxa Community,

I am working on real-time object detection on the Radxa Zero 3 (RK3566) using the RKNN model. The model I am using is YOLO11n-rk3566.rknn , and I am running it with rknnlite API . The goal is to achieve low-latency inference for a live video feed. However, I am facing significant performance issues and need some guidance on optimizing my implementation.

Issue:

  • I am experiencing a latency of ~700-800ms per frame , which results in only 2-3 FPS .
  • Given the RK3566’s NPU capabilities, I was expecting much better performance.

Setup Details:

  • Device: Radxa Zero 3 (RK3566)
  • Model: YOLO11n-rk3566.rknn
  • API Used: RKNNLite
  • Camera: USB Webcam (MJPG format, 640x640, 60 FPS)
  • Code Overview:
    • Using a dedicated thread to capture frames
    • Processing frames using the RKNN model
    • Drawing detections and displaying results in OpenCV

My Observations:

  1. NPU Utilization:
  • I am using rknnlite for inference, but it seems the NPU is not delivering optimal performance.
  1. Frame Capture Latency:
  • Even though I set cv2.CAP_PROP_BUFFERSIZE=1 and capture at 60 FPS, I still get only 2-3 FPS output.
  1. Inference Latency:
  • The model inference seems to be the bottleneck, taking most of the processing time.
  1. Post-processing Overhead:
  • The bounding box conversion, filtering, and drawing detections might be adding extra delay.

Code Implementation:

Here is the core part of my implementation:

from rknnlite.api import RKNNLite
import cv2
import numpy as np
import threading
from queue import Queue
import time

  class OptimizedDetector:
      def __init__(self, model_path='./yolo11n_3566_rknn_model/yolo11n-rk3566.rknn'):
          self.rknn_lite = RKNNLite()
          self.rknn_lite.load_rknn(model_path)
          self.rknn_lite.init_runtime()
          
          self.cap = cv2.VideoCapture(0)
          self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)
          self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
          self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 640)
          self.cap.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter_fourcc(*'MJPG'))
          self.cap.set(cv2.CAP_PROP_FPS, 60)
          
          self.frame_queue = Queue(maxsize=1)
          self.running = True
      
      def capture_frames(self):
          while self.running:
              ret, frame = self.cap.read()
              if ret:
                  if self.frame_queue.full():
                      self.frame_queue.get_nowait()
                  self.frame_queue.put(frame)

      def preprocess_image(self, image):
          img = cv2.resize(image, (640, 640))
          img = np.expand_dims(img, axis=0).astype(np.uint8)
          return img

      def run(self):
          capture_thread = threading.Thread(target=self.capture_frames, daemon=True)
          capture_thread.start()
          
          while True:
              if self.frame_queue.empty():
                  continue
              
              frame = self.frame_queue.get()
              input_data = self.preprocess_image(frame)
              outputs = self.rknn_lite.inference(inputs=[input_data])
              
              cv2.imshow('YOLO Detection', frame)
              if cv2.waitKey(1) & 0xFF == ord('q'):
                  break

          self.cleanup()

      def cleanup(self):
          self.running = False
          self.cap.release()
          cv2.destroyAllWindows()
          self.rknn_lite.release()

  if __name__ == "__main__":
      detector = OptimizedDetector()
      detector.run()

Questions:

  1. Is there a way to speed up inference on RK3566?
  2. Am I missing any optimizations for RKNNLite?
  3. Are there any recommended configurations for NPU acceleration on Radxa Zero 3?
  4. Would using a different input format (e.g., RGB instead of MJPG) help in reducing latency?

Any suggestions, corrections, or optimizations would be greatly appreciated! If anyone has achieved better speeds on RK3566 with RKNN models, please share your experience.

Thanks in advance!

Best, Sunil

@Sunil_Ghanchi Have you made any progress with this? I am experiencing the same issue on the Rock3C (RK3566). Inferencing was around 400-500ms per frame. Can I ask you how much RAM your board has? I seen somewhere that the NPU may require 2gb to run at full speed but I can’t remember where I seen that or how valid that actually is. The board I have only has 1gb, hence why I ask.

I did get a small improvement by increasing the NPU frequency but I am still only seeing 200-300ms per frame. The advertised performance in the rknn model zoo shows it should be around 50ms per frame.

After hours wasted, I believe I figured out the problem (well at least my problem). The pre-converted rknn model provided by Radxa for the RK3566 must be outdated. After downloading the yolo11n.onnx model and converting it to .rknn for the rk3566, with the latest version of the RKNN Toolkit2, I am seeing inferencing around 75ms per frame. Not quite the 20fps as advertised but much better than 3fps.

I also had to update all of the pieces of the NPU chain, the NPU driver, runtime, and rknn-toolkit-lite2, as the versions that are on the Rock3C pre-built image are all old. And I increased the NPU frequency to the max. I am not sure if these made a major difference in the end or not.