Thanks for your reply about how the inference time of ~43ms is calculated.
Unfortunately I disagree with what CIX has done here by only timing the noe_job_infer_sync()
call in the NOE_Engine code. This is misleading as it does not represent inference time used in Computer Vision so the FPS values calculated do not reflect real world usage.
A true FPS value is timed from the image/data being loaded into memory, such as loading an image from disk, or capturing a frame from a camera. Then passing that data to the NPU for inference and including post processing of the Model and rendering of detected objects on the image ready for output to the user.
However let me put that complaint to the side, as there is big problem with the example python code which is made obvious by comparing to the RK3588.
I have modified the NOE Engine forward()
method to output fine grain execution timing here. This results in output timings as follows;
Tensor retrieval time for tensor 0: 22.00 ms
Size of data retrieved for tensor 0: 705600 bytes
Data conversion time for tensor 0: 19.34 ms
Normalization time for tensor 0: 0.73 ms
Data preparation time: 17.59 ms
NPU inference time: 43.17 ms
Data retrieval time: 42.18 ms
Total time: 102.94 ms
Here we can see the NPU inference time
time of 43.17 ms
which is the benchmark time CIX claims. The Total time
of 102.94 ms
is my timing, which is what a realistic inference time represents.
Now this is were the problem is, why is it taking 22 ms
for noe_get_tensor()
to retrieve a tensor of 705k bytes in size? Similarly it takes 17.59 ms
to load the input tensors with noe_load_tensor()
. This is so slow, I can’t understand how this has not been noticed?
Taking the same yolov8_l model from the ONNX source, I compile it to RKNN format to run on RK3588. I added the same timing sections to the code and we get results of;
Set inputs time= 389.367µs
Run model time= 144.061837ms
Get Outputs time= 2.722946ms
The following table compares these execution timings;
Timing |
Orion O6 |
RK3588 |
Setting input tensors |
17.59 ms |
389.367µs |
Inference pass on NPU |
43.17 ms |
144 ms |
Retrieving output tensors |
42.18 ms |
2.72 ms |
The table makes it obvious that the Orion NPU is much faster than the RK3588 on the inference pass, however there is some major slowness with tensor input and output processing on the Orion code.
Is this just due to Python or the NOE library? Does CIX have a C/C++ example that performs differently?