Hello Radxa Community,
I am working on real-time object detection on the Radxa Zero 3 (RK3566) using the RKNN model. The model I am using is YOLO11n-rk3566.rknn , and I am running it with rknnlite API . The goal is to achieve low-latency inference for a live video feed. However, I am facing significant performance issues and need some guidance on optimizing my implementation.
Issue:
- I am experiencing a latency of ~700-800ms per frame , which results in only 2-3 FPS .
- Given the RK3566’s NPU capabilities, I was expecting much better performance.
Setup Details:
- Device: Radxa Zero 3 (RK3566)
- Model: YOLO11n-rk3566.rknn
- API Used: RKNNLite
- Camera: USB Webcam (MJPG format, 640x640, 60 FPS)
-
Code Overview:
- Using a dedicated thread to capture frames
- Processing frames using the RKNN model
- Drawing detections and displaying results in OpenCV
My Observations:
- NPU Utilization:
- I am using
rknnlite
for inference, but it seems the NPU is not delivering optimal performance.
- Frame Capture Latency:
- Even though I set
cv2.CAP_PROP_BUFFERSIZE=1
and capture at 60 FPS, I still get only 2-3 FPS output.
- Inference Latency:
- The model inference seems to be the bottleneck, taking most of the processing time.
- Post-processing Overhead:
- The bounding box conversion, filtering, and drawing detections might be adding extra delay.
Code Implementation:
Here is the core part of my implementation:
from rknnlite.api import RKNNLite
import cv2
import numpy as np
import threading
from queue import Queue
import time
class OptimizedDetector:
def __init__(self, model_path='./yolo11n_3566_rknn_model/yolo11n-rk3566.rknn'):
self.rknn_lite = RKNNLite()
self.rknn_lite.load_rknn(model_path)
self.rknn_lite.init_runtime()
self.cap = cv2.VideoCapture(0)
self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)
self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 640)
self.cap.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter_fourcc(*'MJPG'))
self.cap.set(cv2.CAP_PROP_FPS, 60)
self.frame_queue = Queue(maxsize=1)
self.running = True
def capture_frames(self):
while self.running:
ret, frame = self.cap.read()
if ret:
if self.frame_queue.full():
self.frame_queue.get_nowait()
self.frame_queue.put(frame)
def preprocess_image(self, image):
img = cv2.resize(image, (640, 640))
img = np.expand_dims(img, axis=0).astype(np.uint8)
return img
def run(self):
capture_thread = threading.Thread(target=self.capture_frames, daemon=True)
capture_thread.start()
while True:
if self.frame_queue.empty():
continue
frame = self.frame_queue.get()
input_data = self.preprocess_image(frame)
outputs = self.rknn_lite.inference(inputs=[input_data])
cv2.imshow('YOLO Detection', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
self.cleanup()
def cleanup(self):
self.running = False
self.cap.release()
cv2.destroyAllWindows()
self.rknn_lite.release()
if __name__ == "__main__":
detector = OptimizedDetector()
detector.run()
Questions:
- Is there a way to speed up inference on RK3566?
- Am I missing any optimizations for RKNNLite?
- Are there any recommended configurations for NPU acceleration on Radxa Zero 3?
- Would using a different input format (e.g., RGB instead of MJPG) help in reducing latency?
Any suggestions, corrections, or optimizations would be greatly appreciated! If anyone has achieved better speeds on RK3566 with RKNN models, please share your experience.
Thanks in advance!
Best, Sunil