CIXBuilder problems compiling ONNX model - slow inference times

3djelly · May 4, 2025, 7:46pm

On the 24Q4 release it has cixbuild version 6.1.2958.
On the 25Q1 release it has cixbuild version 6.1.3119.

Using the YOLOv8 example from the Model Zoo we download the ONNX model from modelscape.

Using cixbuild version 6.1.2958 we compile the ONNX model to .cix format using the vendor provider cfg file yolov8_lbuild.cfg.

We modified the vendor inference_npu.py script to do simple benchmarking and output the object detection results. Inference on the input image results in an average inference time of 110ms.

Then using cixbuild version 6.1.3119 we repeat the process and compile the same ONNX model.
However average inference time is three times slower at 340ms.

These tests are done on Radxa’s Debian b3 image.

The .cix compiled model at modelscape also runs with average inference at 110ms.

So what is the reason for cixbuild version 6.1.3119 producing a result that is three times slower?

Secondly in the yolov8_lbuild.cfg file the optimizer has settings;

trigger_float_op = disable & <[(258, 272)]:float16_preferred!>
weight_bits = 8& <[(273,274)]:16>
activation_bits = 8& <[(273,274)]:16>
bias_bits = 32& <[(273,274)]:48>

How are those magic numbers (258,272) and (273,274) determined?

3djelly · May 5, 2025, 8:58am

Managed to work out the magic numbers refer to layers on the ONNX model.

Which enabled me to pick the magic numbers on a yolov8s model of;

trigger_float_op = disable & <[(168,180)]:float16_preferred!>
weight_bits = 8& <[(181,182)]:16>
activation_bits = 8& <[(181,182)]:16>
bias_bits = 32& <[(181,182)]:48>

However we get the same bad performance with 320ms inference time when using cixbuild version 6.1.3119.

When using cixbuild version 6.1.2958 compiling fails with;

[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[E] [OPT] [20:54:06]: 'node info: name=/model.22/Sigmoid, type=OpType.Activation, layer_id = 171': error message: RuntimeError('"sigmoid_cpu" not implemented for \'Half\'')
[I] [OPT] [20:54:06]: Compass-Optimizer has done at [quantize] period.
[I] [OPT] [20:54:06]: [Done]cost time: 1770.5304999351501
[E] Optimizing model failed! "sigmoid_cpu" not implemented for 'Half' ...

We can bypass that error by changing config parameter to;

trigger_float_op = disable & <[(168,170),(172,180)]:float16_preferred!>

However when running inference we get 0 objects detected and inference time is 73ms which is considerably slower than the RK3588 for the same model size at 48ms.

For comparions the yolov8l sized model’s inference time on Orion is 110ms versus RK3588’s 133ms. That is a 20% performance boost, however this is a direct result of the Orion’s NPU clock speed being set at 1.2Ghz wheres the RK3588’s is 1Ghz.

yuthon · May 7, 2025, 11:52pm

I encountered similar problems with the 25Q1 release.

You might want to check inside the forward method of EngineInfer class in the NOE_Engine.py file, as it consists of three parts: setting the model’s input and output, and the actual inference process. Based on my tests, the self-converted INT8+FP16 mixed-precision model takes significantly more time to set inputs and outputs compared to the official INT8 CIX model, while their pure inference latency is similar.

If you are only measuring the time of the forward method in inference_npu.py, this reflects the “inference latency including input and output”, meaning it includes the time for data copying. In contrast, “pure inference latency” refers solely to the time taken by the model’s inference process.

I also noticed that the 25Q1 release is recommended to run on the 2503 BSP (i.e., the b6 image). Could this be related?

3djelly · May 8, 2025, 8:48am

I wondered about that also and brought another NVMe drive and tried the b6 image too, but still saw the same slow inference times.

What inference models have you been trying?

3djelly · May 8, 2025, 10:58am

I have also tried the driver packages cix-noe-umd and cix-npu-driver that ship with the 25Q1 release, but get no improvement.

Also on the coding.net model hub for yolov8_l it claims an inference time of 50.98ms, the best I can get is ~110ms so something is off.

3djelly · May 8, 2025, 7:32pm

I found a blog post here of someone getting 29ms inference for yolox_l model which is what CIX claims, but I get 109ms. I am not doing anything different to their steps so wonder if there is something wrong with my Orion and/or NPU.

Can you run this test on CIX’s YOLOv8 model to compare results?

DONGHUANG · May 16, 2025, 10:59am

I test yolov8l inference time is 43ms ,the better result is 38ms about 25Q1. You can use the interface provided by NOE_Engine.py to test inference time to avoid time errors.

3djelly · May 17, 2025, 4:46am

Thanks for your reply about how the inference time of ~43ms is calculated.

Unfortunately I disagree with what CIX has done here by only timing the noe_job_infer_sync() call in the NOE_Engine code. This is misleading as it does not represent inference time used in Computer Vision so the FPS values calculated do not reflect real world usage.

A true FPS value is timed from the image/data being loaded into memory, such as loading an image from disk, or capturing a frame from a camera. Then passing that data to the NPU for inference and including post processing of the Model and rendering of detected objects on the image ready for output to the user.

However let me put that complaint to the side, as there is big problem with the example python code which is made obvious by comparing to the RK3588.

I have modified the NOE Engine forward() method to output fine grain execution timing here. This results in output timings as follows;

Tensor retrieval time for tensor 0: 22.00 ms
Size of data retrieved for tensor 0: 705600 bytes
Data conversion time for tensor 0: 19.34 ms
Normalization time for tensor 0: 0.73 ms
Data preparation time: 17.59 ms
NPU inference time: 43.17 ms
Data retrieval time: 42.18 ms
Total time: 102.94 ms

Here we can see the NPU inference time time of 43.17 ms which is the benchmark time CIX claims. The Total time of 102.94 ms is my timing, which is what a realistic inference time represents.

Now this is were the problem is, why is it taking 22 ms for noe_get_tensor() to retrieve a tensor of 705k bytes in size? Similarly it takes 17.59 ms to load the input tensors with noe_load_tensor(). This is so slow, I can’t understand how this has not been noticed?

Taking the same yolov8_l model from the ONNX source, I compile it to RKNN format to run on RK3588. I added the same timing sections to the code and we get results of;

Set inputs time= 389.367µs
Run model time= 144.061837ms
Get Outputs time= 2.722946ms

The following table compares these execution timings;

Timing	Orion O6	RK3588
Setting input tensors	17.59 ms	389.367µs
Inference pass on NPU	43.17 ms	144 ms
Retrieving output tensors	42.18 ms	2.72 ms

The table makes it obvious that the Orion NPU is much faster than the RK3588 on the inference pass, however there is some major slowness with tensor input and output processing on the Orion code.

Is this just due to Python or the NOE library? Does CIX have a C/C++ example that performs differently?

3djelly · May 18, 2025, 11:03am

I have been playing about with trying to get a C/C++ version running and have come across some other issues;

With reference to the NPU SDK User Guide v0.6 that shipped with the 25Q1 release;

Page 104, it states there is a “sample” folder with C/C++ examples with the UMD driver source code package. Where do we get the UMD source code? Where are these samples?

The .deb package does not contain any source.

$ dpkg-deb -c  25q1-drivers/cix-noe-umd_1.0.0_arm64.deb
drwxr-xr-x root/root         0 2025-03-26 14:24 ./
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/pkgconfig/
-rwxr-xr-x root/root       265 2025-03-26 14:24 ./usr/lib/aarch64-linux-gnu/pkgconfig/cix-noe-umd.pc
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/python3/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/lib/python3/dist-packages/
-rwxr-xr-x root/root   2778784 2025-03-26 14:24 ./usr/lib/python3/dist-packages/libnoe.so
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/include/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/include/npu/
-rwxr-xr-x root/root     55190 2025-03-26 14:24 ./usr/share/cix/include/npu/cix_noe_standard_api.h
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/cix/lib/
-rwxr-xr-x root/root    904314 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.a
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so.0
-rwxr-xr-x root/root   1991264 2025-03-26 14:24 ./usr/share/cix/lib/libnoe.so.0.5.0
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/doc/
drwxr-xr-x root/root         0 2025-03-26 14:24 ./usr/share/doc/cix-noe-umd/
-rwxr-xr-x root/root       182 2025-03-26 14:24 ./usr/share/doc/cix-noe-umd/changelog.Debian.gz

Page 105, there are inconsistencies in the code snippets such as noe_ctx_handle_t which does not exist in the header file /usr/share/cix/include/npu/cix_noe_standard_api.h. Instead you have to use context_handler_t.
Page 103, it lists API method noe_get_device_info however this method is not available in the compiled library, but does exist in the header?

/usr/bin/ld: /tmp/ccxrI9P9.ltrans0.ltrans.o: in function `main':
<artificial>:(.text.startup+0x418): undefined reference to `noe_get_device_info'
collect2: error: ld returned 1 exit status

$ grep noe_get_device_info /usr/share/cix/include/npu/cix_noe_standard_api.h
noe_status_t noe_get_device_info(const context_handler_t* ctx, device_info_t* device_info);

This means the files shipped in the 25Q1 package cix-noe-umd_1.0.0_arm64.deb are mismatched.

$ nm -D --defined-only /usr/share/cix/lib/libnoe.so | grep device_info
# returns nothing

I have some C++ code running but its not running full inference yet, however it does allow me to compare against the vendor python code for code execution of loading and reading output tensors.

Tensor load time: 3.26896 ms
Inference sync time: 43.9627 ms
Ran job inference sync
Fetch outputs time: 4.37488 ms

Compared to previous post this is much better.

Timing	Orion O6 - Python	RK3588	Orion O6 - C++
Setting input tensors	17.59 ms	389.367µs	3.26 ms
Inference pass on NPU	43.17 ms	144 ms	43.9 ms
Retrieving output tensors	42.18 ms	2.72 ms	4.37 ms

There could be further improvement if I can figure out the DMA buffer stuff…

3djelly · May 18, 2025, 10:14pm

Tried to use the noe_malloc() method to implement a DMA buffer… however I ran into the same problem as noe_get_device_info() with the function not existing in the compiled library.

/usr/bin/ld: /tmp/ccZ2jwzp.ltrans0.ltrans.o: in function `main':
<artificial>:(.text.startup+0x524): undefined reference to `noe_malloc'
collect2: error: ld returned 1 exit status

$ grep noe_malloc /usr/share/cix/include/npu/cix_noe_standard_api.h 
noe_status_t noe_malloc(const context_handler_t* ctx, uint32_t size, void** va, bool dma_buff);
 * @brief This API frees a buffer allocated by noe_malloc

The following are the only functions that exist in /usr/share/cix/lib/libnoe.so;

$ nm -D --defined-only /usr/share/cix/lib/libnoe.so | grep noe_
00000000000aad84 T _Z21noe_get_device_statusPK15context_handlerP15device_status_t
00000000000aa150 T _Z22noe_load_output_tensorPK15context_handlermjPKv
00000000000ab5d0 T _ZN7aipudrv11MainContext14get_static_msgE12noe_status_t
00000000000b8a70 T _ZN7aipudrv11MainContext14get_status_msgE12noe_status_tPPKc
00000000000ab144 T noe_add_batch
00000000000ab054 T noe_clean_batch_queue
00000000000a9eb0 T noe_clean_job
00000000000aae50 T noe_config_batch_dump
00000000000aaba0 T noe_config_global
00000000000aaa44 T noe_config_job
00000000000aaf60 T noe_create_batch_queue
00000000000a9c04 T noe_create_job
00000000000aa6b0 T noe_debugger_bind_job
00000000000aa7a0 T noe_debugger_run_job
00000000000a9820 T noe_deinit_context
00000000000ab260 T noe_finish_batch
00000000000aa324 T noe_get_cluster_count
00000000000aa400 T noe_get_core_count
00000000000a9560 T noe_get_error_message
00000000000a9e30 T noe_get_job_status
00000000000aa260 T noe_get_partition_count
00000000000aacc0 T noe_get_target
00000000000aa1d0 T noe_get_tensor
00000000000a9f24 T noe_get_tensor_count
00000000000a9fb0 T noe_get_tensor_descriptor
00000000000a9700 T noe_init_context
00000000000ab390 T noe_ioctl
00000000000a9dd0 T noe_job_infer_async
00000000000a9d10 T noe_job_infer_sync
00000000000a9900 T noe_load_graph
00000000000a9a40 T noe_load_graph_helper
00000000000aa0d0 T noe_load_tensor
00000000000a9380 T noe_set_freq
00000000000aab10 T noe_specify_iobuf
00000000000a9b40 T noe_unload_graph

Here is a list of missing functions from the libnoe.so library that are mentioned in the NPU SDK Guide document.

noe_set_job_attr
noe_job_infer_wait
noe_get_tensor_attr_by_?
noe_get_tensor_data_by_?
noe_set_input_tensor_da?
noe_get_job_info
noe_bind_job_core
noe_get_device_info
noe_malloc
noe_free

@DONGHUANG - can we get an updated UMD release to fix these issues please?

3djelly · May 19, 2025, 9:51am

Found the 202504 SDK release but unfortunately that version of the UMD libnoe.so library is also missing the above mentioned functions;

$ git log
commit 9a572ac0a575f248513506c95e51efedfae43804 (HEAD, origin/874c4/78b98/cix_p1_radxa_acpi_beta_dev)
Author: ReleaseRaptor <ReleaseRaptor@cixtech.com>
Date:   Wed Apr 30 11:36:02 2025 +0800

    CIX ACPI 202504 release version
radxa@orion-o6:~/devel/cix_proprietary/cix_proprietary-debs/cix-noe-umd/usr$ nm -D --defined-only share/cix/lib/libnoe.so | grep noe_
00000000000aad84 T _Z21noe_get_device_statusPK15context_handlerP15device_status_t
00000000000aa150 T _Z22noe_load_output_tensorPK15context_handlermjPKv
00000000000ab5d0 T _ZN7aipudrv11MainContext14get_static_msgE12noe_status_t
00000000000b8a70 T _ZN7aipudrv11MainContext14get_status_msgE12noe_status_tPPKc
00000000000ab144 T noe_add_batch
00000000000ab054 T noe_clean_batch_queue
00000000000a9eb0 T noe_clean_job
00000000000aae50 T noe_config_batch_dump
00000000000aaba0 T noe_config_global
00000000000aaa44 T noe_config_job
00000000000aaf60 T noe_create_batch_queue
00000000000a9c04 T noe_create_job
00000000000aa6b0 T noe_debugger_bind_job
00000000000aa7a0 T noe_debugger_run_job
00000000000a9820 T noe_deinit_context
00000000000ab260 T noe_finish_batch
00000000000aa324 T noe_get_cluster_count
00000000000aa400 T noe_get_core_count
00000000000a9560 T noe_get_error_message
00000000000a9e30 T noe_get_job_status
00000000000aa260 T noe_get_partition_count
00000000000aacc0 T noe_get_target
00000000000aa1d0 T noe_get_tensor
00000000000a9f24 T noe_get_tensor_count
00000000000a9fb0 T noe_get_tensor_descriptor
00000000000a9700 T noe_init_context
00000000000ab390 T noe_ioctl
00000000000a9dd0 T noe_job_infer_async
00000000000a9d10 T noe_job_infer_sync
00000000000a9900 T noe_load_graph
00000000000a9a40 T noe_load_graph_helper
00000000000aa0d0 T noe_load_tensor
00000000000a9380 T noe_set_freq
00000000000aab10 T noe_specify_iobuf
00000000000a9b40 T noe_unload_graph