CIXBuilder problems compiling ONNX model - slow inference times

On the 24Q4 release it has cixbuild version 6.1.2958.
On the 25Q1 release it has cixbuild version 6.1.3119.

Using the YOLOv8 example from the Model Zoo we download the ONNX model from modelscape.

Using cixbuild version 6.1.2958 we compile the ONNX model to .cix format using the vendor provider cfg file yolov8_lbuild.cfg.

We modified the vendor inference_npu.py script to do simple benchmarking and output the object detection results. Inference on the input image results in an average inference time of 110ms.

Then using cixbuild version 6.1.3119 we repeat the process and compile the same ONNX model.
However average inference time is three times slower at 340ms.

These tests are done on Radxa’s Debian b3 image.

The .cix compiled model at modelscape also runs with average inference at 110ms.

So what is the reason for cixbuild version 6.1.3119 producing a result that is three times slower?

Secondly in the yolov8_lbuild.cfg file the optimizer has settings;

trigger_float_op = disable & <[(258, 272)]:float16_preferred!>
weight_bits = 8& <[(273,274)]:16>
activation_bits = 8& <[(273,274)]:16>
bias_bits = 32& <[(273,274)]:48>

How are those magic numbers (258,272) and (273,274) determined?

Managed to work out the magic numbers refer to layers on the ONNX model.

Which enabled me to pick the magic numbers on a yolov8s model of;

trigger_float_op = disable & <[(168,180)]:float16_preferred!>
weight_bits = 8& <[(181,182)]:16>
activation_bits = 8& <[(181,182)]:16>
bias_bits = 32& <[(181,182)]:48>

However we get the same bad performance with 320ms inference time when using cixbuild version 6.1.3119.

When using cixbuild version 6.1.2958 compiling fails with;

[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[I] [OPT] [20:54:06]: Compass-Optimizer has done at [quantize] period.
[I] [OPT] [20:54:06]: [Done]cost time: 1770.5304999351501
[E] Optimizing model failed! "sigmoid_cpu" not implemented for 'Half' ...

We can bypass that error by changing config parameter to;

trigger_float_op = disable & <[(168,170),(172,180)]:float16_preferred!>

However when running inference we get 0 objects detected and inference time is 73ms which is considerably slower than the RK3588 for the same model size at 48ms.

For comparions the yolov8l sized model’s inference time on Orion is 110ms versus RK3588’s 133ms. That is a 20% performance boost, however this is a direct result of the Orion’s NPU clock speed being set at 1.2Ghz wheres the RK3588’s is 1Ghz.

I encountered similar problems with the 25Q1 release.

You might want to check inside the forward method of EngineInfer class in the NOE_Engine.py file, as it consists of three parts: setting the model’s input and output, and the actual inference process. Based on my tests, the self-converted INT8+FP16 mixed-precision model takes significantly more time to set inputs and outputs compared to the official INT8 CIX model, while their pure inference latency is similar.

If you are only measuring the time of the forward method in inference_npu.py, this reflects the “inference latency including input and output”, meaning it includes the time for data copying. In contrast, “pure inference latency” refers solely to the time taken by the model’s inference process.

I also noticed that the 25Q1 release is recommended to run on the 2503 BSP (i.e., the b6 image). Could this be related?

I wondered about that also and brought another NVMe drive and tried the b6 image too, but still saw the same slow inference times.

What inference models have you been trying?

I have also tried the driver packages cix-noe-umd and cix-npu-driver that ship with the 25Q1 release, but get no improvement.

Also on the coding.net model hub for yolov8_l it claims an inference time of 50.98ms, the best I can get is ~110ms so something is off.

I found a blog post here of someone getting 29ms inference for yolox_l model which is what CIX claims, but I get 109ms. I am not doing anything different to their steps so wonder if there is something wrong with my Orion and/or NPU.

Can you run this test on CIX’s YOLOv8 model to compare results?

I test yolov8l inference time is 43ms ,the better result is 38ms about 25Q1. You can use the interface provided by NOE_Engine.py to test inference time to avoid time errors.

Thanks for your reply about how the inference time of ~43ms is calculated.

Unfortunately I disagree with what CIX has done here by only timing the noe_job_infer_sync() call in the NOE_Engine code. This is misleading as it does not represent inference time used in Computer Vision so the FPS values calculated do not reflect real world usage.

A true FPS value is timed from the image/data being loaded into memory, such as loading an image from disk, or capturing a frame from a camera. Then passing that data to the NPU for inference and including post processing of the Model and rendering of detected objects on the image ready for output to the user.

However let me put that complaint to the side, as there is big problem with the example python code which is made obvious by comparing to the RK3588.

I have modified the NOE Engine forward() method to output fine grain execution timing here. This results in output timings as follows;

Tensor retrieval time for tensor 0: 22.00 ms
Size of data retrieved for tensor 0: 705600 bytes
Data conversion time for tensor 0: 19.34 ms
Normalization time for tensor 0: 0.73 ms
Data preparation time: 17.59 ms
NPU inference time: 43.17 ms
Data retrieval time: 42.18 ms
Total time: 102.94 ms

Here we can see the NPU inference time time of 43.17 ms which is the benchmark time CIX claims. The Total time of 102.94 ms is my timing, which is what a realistic inference time represents.

Now this is were the problem is, why is it taking 22 ms for noe_get_tensor() to retrieve a tensor of 705k bytes in size? Similarly it takes 17.59 ms to load the input tensors with noe_load_tensor(). This is so slow, I can’t understand how this has not been noticed?

Taking the same yolov8_l model from the ONNX source, I compile it to RKNN format to run on RK3588. I added the same timing sections to the code and we get results of;

Set inputs time= 389.367µs
Run model time= 144.061837ms
Get Outputs time= 2.722946ms

The following table compares these execution timings;

Timing Orion O6 RK3588
Setting input tensors 17.59 ms 389.367µs
Inference pass on NPU 43.17 ms 144 ms
Retrieving output tensors 42.18 ms 2.72 ms

The table makes it obvious that the Orion NPU is much faster than the RK3588 on the inference pass, however there is some major slowness with tensor input and output processing on the Orion code.

Is this just due to Python or the NOE library? Does CIX have a C/C++ example that performs differently?

1 Like

I have been playing about with trying to get a C/C++ version running and have come across some other issues;

With reference to the NPU SDK User Guide v0.6 that shipped with the 25Q1 release;

  1. Page 104, it states there is a “sample” folder with C/C++ examples with the UMD driver source code package. Where do we get the UMD source code? Where are these samples?

The .deb package does not contain any source.

$ dpkg-deb -c  25q1-drivers/cix-noe-umd_1.0.0_arm64.deb
drwxr-xr-x root/root         0 2025-03-26 14:24 ./
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/pkgconfig/
-rwxr-xr-x root/root       265 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/pkgconfig/cix-noe-umd.pc
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/python3/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/python3/dist-packages/
-rwxr-xr-x root/root   2778784 2025-03-26 14:24 ./usr/lib/python3/dist-packages/libnoe.so
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/include/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/include/npu/
-rwxr-xr-x root/root     55190 2025-03-26 14:24 ./usr/share/cix/include/npu/cix_noe_standard_api.h
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/lib/
-rwxr-xr-x root/root    904314 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.a
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so.0
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so.0.5.0
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/doc/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/doc/cix-noe-umd/
-rwxr-xr-x root/root       182 2025-03-26 14:24 ./usr/share/doc/cix-noe-umd/changelog.Debian.gz
  1. Page 105, there are inconsistencies in the code snippets such as noe_ctx_handle_t which does not exist in the header file /usr/share/cix/include/npu/cix_noe_standard_api.h. Instead you have to use context_handler_t.

  2. Page 103, it lists API method noe_get_device_info however this method is not available in the compiled library, but does exist in the header?

/usr/bin/ld: /tmp/ccxrI9P9.ltrans0.ltrans.o: in function `main':
<artificial>:(.text.startup+0x418): undefined reference to `noe_get_device_info'
collect2: error: ld returned 1 exit status
$ grep noe_get_device_info /usr/share/cix/include/npu/cix_noe_standard_api.h
noe_status_t noe_get_device_info(const context_handler_t* ctx, device_info_t* device_info);

This means the files shipped in the 25Q1 package cix-noe-umd_1.0.0_arm64.deb are mismatched.

$ nm -D --defined-only /usr/share/cix/lib/libnoe.so | grep device_info
# returns nothing

I have some C++ code running but its not running full inference yet, however it does allow me to compare against the vendor python code for code execution of loading and reading output tensors.

Tensor load time: 3.26896 ms
Inference sync time: 43.9627 ms
Ran job inference sync
Fetch outputs time: 4.37488 ms

Compared to previous post this is much better.

Timing Orion O6 - Python RK3588 Orion O6 - C++
Setting input tensors 17.59 ms 389.367µs 3.26 ms
Inference pass on NPU 43.17 ms 144 ms 43.9 ms
Retrieving output tensors 42.18 ms 2.72 ms 4.37 ms

There could be further improvement if I can figure out the DMA buffer stuff…

1 Like