Yeah we are lacking empirical benchmarks of input size and resultant mAP and FPS.
It would seem some conversions bench ms (FPS) but forget to test mAP (Mean Average Precision) or mention input size.
Use YoloV8 in RK3588 NPU
I managed to get yolov8n (v1.6) working with the 3 cores and compare it to yolov5s (v1.5), so this might interest you.
Note: My observation is from a programmerâs perspective, not from AI programmerâs perspective.
Yolov8 seems to be more accurate but misses some detection for the video sample (cars), thus a bit faster.
The test: X11 (dual-head), libmali, performance
YOLOv5s:
DISPLAY=:0.0 ./rknn_yolov5_demo ./model/RK3588/yolov5s-640-640.rknn ../../../../../videos_rknn/h264.FVDO_Freeway_720p.264
樥ĺĺ称: ./model/RK3588/yolov5s-640-640.rknn
Threads: 12
Loading mode...
model is NHWC input fmt
Loading mode...
rga_api version 1.9.1_[4]
loadLabelName ./model/coco_80_labels_list.txt
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
Loading mode...
model is NHWC input fmt
60 frames avg: 90.909091 frames
60 frames avg: 109.689214 frames
60 frames avg: 118.577075 frames
avg: 105.616897 frames
YOLOv8n:
DISPLAY=:0.0 ./rknn_yolov8_demo ./model/RK3588/yolov8n.rknn ../../../../../videos_rknn/h264.FVDO_Freeway_720p.264
樥ĺĺ称: ./model/RK3588/yolov8n.rknn
Threads: 12
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
rga_api version 1.9.1_[4]
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
loadLabelName ./model/coco_80_labels_list.txt
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
Loading mode...
model is NHWC input fmt
model input height=640, width=640, channel=3
60 frames avg: 102.389078 frames
60 frames avg: 113.851992 frames
60 frames avg: 112.994350 frames
avg: 108.534776 frames
Regarding the 320 Image resolution, unfortunately i lack some AI background to conduct a real test.
320 doesnât matter as the models are trained on 640, you can squish the model to 320 and obviously it runs faster, but less accurately.
You can use https://docs.ultralytics.com/modes/val/#introduction to get the mAP of the 640 vs 320 but for same model when altered from the training res of 640 inveriabilly it loses (mean Average Precision)
Likley the ratio will be just the same for the rknn models.
There is a dataset https://cocodataset.org/#home that also includes the images with the metadata of what should be detected. So what you do is download and run and totally forgot how mAP is calculated as its not just if detected but how well the bounding box did https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52
I am not sure what avg: 108.534776 frames
is it meaning 108 FPS?
I apologise @avaf as I meant to but forgot to test https://github.com/airockchip/rknn_model_zoo
Yes, 108 fpsâŚ
how were you able to achieve 108FPS at 640x640? would you mind sharing which changes you made to get yolov8n working on 3 cores? I am using the yolov8 example from rknn_model_zoo, but because they took the DFL part out of the model, the post-processing is slow and in total (inference + post processing) it takes around 30-40ms per frame (in python).