CIXBuilder problems compiling ONNX model - slow inference times

On the 24Q4 release it has cixbuild version 6.1.2958.
On the 25Q1 release it has cixbuild version 6.1.3119.

Using the YOLOv8 example from the Model Zoo we download the ONNX model from modelscape.

Using cixbuild version 6.1.2958 we compile the ONNX model to .cix format using the vendor provider cfg file yolov8_lbuild.cfg.

We modified the vendor inference_npu.py script to do simple benchmarking and output the object detection results. Inference on the input image results in an average inference time of 110ms.

Then using cixbuild version 6.1.3119 we repeat the process and compile the same ONNX model.
However average inference time is three times slower at 340ms.

These tests are done on Radxa’s Debian b3 image.

The .cix compiled model at modelscape also runs with average inference at 110ms.

So what is the reason for cixbuild version 6.1.3119 producing a result that is three times slower?

Secondly in the yolov8_lbuild.cfg file the optimizer has settings;

trigger_float_op = disable & <[(258, 272)]:float16_preferred!>
weight_bits = 8& <[(273,274)]:16>
activation_bits = 8& <[(273,274)]:16>
bias_bits = 32& <[(273,274)]:48>

How are those magic numbers (258,272) and (273,274) determined?

Managed to work out the magic numbers refer to layers on the ONNX model.

Which enabled me to pick the magic numbers on a yolov8s model of;

trigger_float_op = disable & <[(168,180)]:float16_preferred!>
weight_bits = 8& <[(181,182)]:16>
activation_bits = 8& <[(181,182)]:16>
bias_bits = 32& <[(181,182)]:48>

However we get the same bad performance with 320ms inference time when using cixbuild version 6.1.3119.

When using cixbuild version 6.1.2958 compiling fails with;

[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[I] [OPT] [20:54:06]: Compass-Optimizer has done at [quantize] period.
[I] [OPT] [20:54:06]: [Done]cost time: 1770.5304999351501
[E] Optimizing model failed! "sigmoid_cpu" not implemented for 'Half' ...

We can bypass that error by changing config parameter to;

trigger_float_op = disable & <[(168,170),(172,180)]:float16_preferred!>

However when running inference we get 0 objects detected and inference time is 73ms which is considerably slower than the RK3588 for the same model size at 48ms.

For comparions the yolov8l sized model’s inference time on Orion is 110ms versus RK3588’s 133ms. That is a 20% performance boost, however this is a direct result of the Orion’s NPU clock speed being set at 1.2Ghz wheres the RK3588’s is 1Ghz.

I encountered similar problems with the 25Q1 release.

You might want to check inside the forward method of EngineInfer class in the NOE_Engine.py file, as it consists of three parts: setting the model’s input and output, and the actual inference process. Based on my tests, the self-converted INT8+FP16 mixed-precision model takes significantly more time to set inputs and outputs compared to the official INT8 CIX model, while their pure inference latency is similar.

If you are only measuring the time of the forward method in inference_npu.py, this reflects the “inference latency including input and output”, meaning it includes the time for data copying. In contrast, “pure inference latency” refers solely to the time taken by the model’s inference process.

I also noticed that the 25Q1 release is recommended to run on the 2503 BSP (i.e., the b6 image). Could this be related?

I wondered about that also and brought another NVMe drive and tried the b6 image too, but still saw the same slow inference times.

What inference models have you been trying?

I have also tried the driver packages cix-noe-umd and cix-npu-driver that ship with the 25Q1 release, but get no improvement.

Also on the coding.net model hub for yolov8_l it claims an inference time of 50.98ms, the best I can get is ~110ms so something is off.

I found a blog post here of someone getting 29ms inference for yolox_l model which is what CIX claims, but I get 109ms. I am not doing anything different to their steps so wonder if there is something wrong with my Orion and/or NPU.

Can you run this test on CIX’s YOLOv8 model to compare results?