How is orion considered an AI PC if ollama cant use a GPU/NPU

How is orion considered an AI PC if ollama cant use a GPU/NPU,
So either im doing something wrong or all the TOPs are useless to Ollama

because it clearly says its running on CPU

No you’re not doing anything wrong, for now CIX has not released anything that I’m aware of that allows to make use of the NPU. So for now it’s just a PC with a pretty decent CPU that delivers PC-like AI performance using pure CPU just because it’s relying on 128b at 6000 MT/s. It’s not exceptional by todays PC standards, and cannot use its full potential due to architectural limitations between the CPUs and the memory controller that the NPU could hopefull help work around. For me it delivers exactly the same memory bandwidth as an AMD 5800X. That’s good but not exceptional considering we’ve moved two generations forward now, and that it’s trivial to add more RAM to a PC. And the PC remains significantly faster thanks to a beefier CPU, e.g. with Llama-3.2-1B-Instruct-Q4_0.gguf, I’m gegtting 362 t/s pp512 and 51.74 tg128 on the AMD vs 221 pp512 and 39.76 tg128 on the O6. And my O6 has its RAM controller overclocked by around 12%.

So for now a second-hand PC remains a better choice, but if we consider the form factor, then the O6 probably stands the only option left.

1 Like

There’s a thread here showing an example YOLO (Object Detection) C++ implementation, but the Python SDK/APIs are very non-optimal right now (to the point where it’s questionable if they’ll really be of much use).

For Ollama, etc, I wouldn’t get your hopes up: It might be a very long time until Llama.cpp can support the Cix NPU (if ever). Would love to see it though.

Shorter term, If the Python API gets fixed up, we might see an example of the Cix NPU used with the Python Transformers library for LLM’s. I’d really like to see this because it’d give some indication as to whether the Cix NPU is actually as capable as it sounds on paper (and is probably a lower-effort endeavour than the above).

Cix’s NPU SDK is, unfortunately, still “early access” - you have to explicitly sign up on their website to get granted access to it.

As an aside, does anyone know why using Vulkan with Llama.cpp is currently so slow? Has a particular bottleneck been identified?

1 Like

So basically their web page is touting all hype and we fell for it, yet nothing is as it really seems.

1 Like

The problem is that there’s a lot of hype around AI these days, and that all companies are trying to just run a python-based demo of something that looks promising, except that the python environment used for AI is a horrible garbage that cannot be used to do anything beyond demos. It’s the only environment that constantly breaks after updates and where the code dependencies takes more storage space than the models, not to mention the amazing waste of RAM when loading models which get duplicated instead of mapped etc. That’s why you generally see such demos as Youtube videos instead of replicable howtos, let’s quickly take a snapshot while it happens to work… But that’s what vendors use to showcase their products:-(

Thus for now, every time I’m reading “NPU” on a product, I think “let’s hope it doesn’t draw power when not used” since I know for certain it will never be usable in that product, all of that being only marketing. It will change, once products start to normalize around standards to communicate with such extensions like happened with OpenCL or Vulkan for GPUs but it’s still too early it seems.

In the case of the O6, one hope could be to offload some layers to the GPU but when I tried this, it was just slowing the CPU down (even with a single layer) due to having only 10 cores. My understanding is that running AI on GPU requires a huge amount of cores because each of them is much slower than CPUs but when combining hundreds to thousands you benefit from the parallelism. Also, I think that when using both GPU + CPU often the model in RAM cannot be shared between the two, which means that twice the amount of memory is needed.

So for now we’re limited to CPU-only processing. Fortunately we have good CPUs with reasonable DRAM bandwidth in the O6 even if, as I mentioned, a $100 second hand 5 year-old Ryzen beats it but not in this form factor since it requires a huge heat sink and fan. For that reason, at home the O6 remains my machine of choice for LLMs just because I don’t want to add a Ryzen here (Ryzens are not good for always-on machines, they draw a lot of power in idle). At work I’m using an Ampere Altra that starts to compete with some entry-level GPUs:

| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | CPU        |      80 |           pp512 |       1774.83 ± 1.73 |
| llama 1B Q4_0                  | 729.75 MiB |     1.24 B | CPU        |      80 |           tg128 |        110.09 ± 0.04 |

This machine has 6 DDR4-2933 channels (140 GB/s theoretical, 131 measured) hence 1.4x O6’s theoretical bandwidth but 3x real bandwidth, and 80 A76 cores helping with prompt processing, which highlights the importance of high CPU performance for PP.

In any case, even if the NPU would one day become usable, we could hope for higher t/s on pp512, but not many more for tg, probably twice at best by filling the memory bandwidth (we don’t even know if the NPU can use it all alone, we only know it’s not limited by the too narrow DSU-CI interconnect).

I do want to insist that the board is a great one anyway. I bought mine after running my tests on the first one. It’s just below its full potential but is already decent depending on what you want to do with it. Of course if you already have an always-on PC with more powerful CPUs and modern DRAM that beats it on all fronts, you could be disappointed, but that’s probably not the case for everyone.

Agree broadly with this.

I’ll also add that I’m pretty sure newer Ryzens with AI processors do a similar thing: The hardware is there, but the Software Compatibility still is not. E.g. I don’t think Llama.cpp is capable of utilzing Ryzen NPUs yet. (Unsure of the quality of the Ryzen NPU libraries - haven’t looked into that. Maybe it’s possible to use their NPU with LLMs via Python and Transformers?).

So, while Cix/Radxa are perhaps guilty of over-hype, they’re not the only ones. The RK3588 NPU drivers are also closed-source (and that chip has been around for a few years now).

I do hope we’ll get a quality release of the Orion NPU Libs soon though and it doesn’t become abandon-ware. I’ll admit, I’d feel pretty jibbed if that happens as the NPU was a big part of the marketing-hype.

ive got alot of nvidia embedded xavier and orins here, AI runs great on them all, very high power systems

1 Like

All the tests I’ve seen on Ryzen AI chips were done using the embedded GPU instead of NPU. And it’s not much pretty, better than pure CPU but worse than entry-level GPUs, in addition to coming on boards with soldered RAM where you’d rather know upfront how much you will need before buying. So I gave up on this front.

Second hand server CPUs with 4-12 DRAM channels are reasonably cheap. At work we found a 64-core EPYC for $600 for example, and 8-12 cores Xeons with 4-8 channel DDR5 can be found for around $250 to install on a $500 board. 8 channels DDR5-4800 top at 300 GB/s (more likely 240 real). That’s much more than frozen designs with 2-4 channels of soldered RAM.

1 Like

lol, that annoys me too!

I hate having to install 6-8GB of python garbage for each project/program in virtual environments to avoid dependency conflicts and to address required python versions.

2 Likes

An OSS driver has been queued for the 6.18 Linux kernel release now. I tested it using the instructions here and it seems to work. Of course, currently it requires some elbow grease like doing your own kernel and Mesa 3D builds.

2 Likes

I don’t want to complain about any OSS effort, however it would be wrong to see it as a replacement for the proprietary driver as it only supports a couple of MobileNet models which are long out dated.

2 Likes