C++ Example running YOLOv8 on the NPU

3djelly · July 25, 2025, 9:43am

I have completed a C++ example running a YOLOv8 model on the NPU, it shows you the full process of how to achieve inference from the source pytorch model to end object detection results.

In comparison to CIX’s python code we get significant performance improvement.

Timing	Python	C++
Setting input tensors	17.22ms	3.07ms
Inference pass on NPU	55.22ms	55.54ms
Retrieving output tensors	42.57ms	6.72ms
Total time	115.01ms	65.33ms

I have also outlined how to work out what the magic numbers are for quantization with the CIX compiler so you can find them for other sized models.

willy · July 25, 2025, 8:21pm

It’s nice to share this. I’m always horrified to see that some companies invest money to develop NN accelerators, and to then waste all the hardware gains through numerous layers of python code that constantly copies and duplicates data till the total wasted time is higher than the hardware calculation time, to the point of making the whole solution worthless.

jimhamiru · July 26, 2025, 7:06am

Thanks, this is very cool!

I noticed in your code you list the data types that NOE supports:

static const char* data_type_to_string(noe_data_type_t dt) {
    switch(dt) {
        case NOE_DATA_TYPE_NONE:  return "NONE";
        case NOE_DATA_TYPE_BOOL:  return "BOOL";
        case NOE_DATA_TYPE_U8:    return "U8";
        case NOE_DATA_TYPE_S8:    return "S8";
        case NOE_DATA_TYPE_U16:   return "U16";
        case NOE_DATA_TYPE_S16:   return "S16";
        case NOE_DATA_TYPE_U32:   return "U32";
        case NOE_DATA_TYPE_S32:   return "S32";
        case NOE_DATA_TYPE_U64:   return "U64";
        case NOE_DATA_TYPE_S64:   return "S64";
        case NOE_DATA_TYPE_F16:   return "F16";
        case NOE_DATA_TYPE_F32:   return "F32";
        case NOE_DATA_TYPE_F64:   return "F64";
        case NOE_DATA_TYPE_BF16:  return "BF16";
        default:                  return "UNKNOWN";
    }
}

Does the F32 support indicate that the NPU might be suitable for use with training models?

I imagine that’s probably far more pain than it’s worth with something like Pytorch (which I think only supports CUDA/ROCM) but just interested in the theoretical capabilities.

3djelly · July 26, 2025, 7:23am

NPU’s are of no use for training as they are designed for inference (forward pass) only, where training requires forward and backward passes with large amounts of memory.