Orion O6 Debug Party Invitation

One other question, I’m seeing around 14W idle power draw using an Apple 61W USB-C PD wall charger, with an NVMe SSD, HDMI active, and a single Ethernet cable plugged in (2.5 Gbps). Does that match up to expectations? Too high? Too low?

Just wanted to compare to anyone else measuring total system power draw.

Nope. That’s nuts.

Let’s wait for China finishing celebrations

Indeed. Also if running HPL across all 12 CPU cores, I get an abysmal 36 Gflops at 36W… that’s in like RISC-V territory in terms of efficiency :stuck_out_tongue:

Re-testing with 8 cores, but I think the default scheduling has something funky going on (and judging by a read-through of this issue, that seems to be what you’ve indicated as well.

I just tried shutting down my O6, with sudo shutdown now, but after shutting down, it booted right back up. It seems like pulling power is the only way to shut this down?

Maybe if I set AC Power Loss restore to not boot up the O6, it will work the way I expect?

Hello from Windows 11 :upside_down_face:

This is an older image of build 22621 I had lying around. Latest insider build 27783 only boots with a single core enabled, otherwise it bugchecks with PHASE0_EXCEPTION. Might investigate at some point.

Onboard Ethernet works with the driver here: coolstar/if_re-win: Open Source Realtek Ethernet Windows driver (Prod signed build: if_re_win_1.0.3.zip (163.7 KB)) - tested up to 1 Gbps.

The PCIe integration seems really well done, with compliant ECAM, cache coherence, ITS-based MSI(-X). As a result, things like NVME are usable out of the box.

Some early measurements:

Geekbench 6.4.0
https://browser.geekbench.com/v6/cpu/10224768 (clocks from screenshot above)
https://browser.geekbench.com/v6/cpu/10223515 (1.8 GHz default for all clusters – ACPI CPPC doesn’t work yet so I had to raise the clocks manually)

Cinebench 2024.1.0

Screenshot%202025-01-31%20194128
Screenshot%202025-01-31%20194140

Note that newer builds have more Arm optimizations (e.g. ARMv8.1 atomics) and in theory could yield better results.

4 Likes

I think I’m having the start of an explanation for the slightly lower perf on CPU0:

$ watch -n 1 -d 'grep hda /proc/interrupts'
Every 1.0s: grep hda /proc/interrupts                                                                                                                                                      orion-o6: Sat Feb  1 21:17:58 2025

160:    6811576          0          0          0          0          0          0          0          0          0          0          0     PDCv1 234 Level     hda irq

This “hda” interrupt strikes at 50k per second on core 0. I’m now trying to figure what causes this. Unloading all snd_* modules crashed the system so I’ll need to blacklist them instead.

Edit: that was it. Here’s the new c2clat output after blacklisting all snd_* modules:
c2clat-hda-off

So now I think that core 0 is exactly the same as core 11, and once that driver is removed, there’s no reason not to use it for measurements.

2 Likes

Jeff, I’m measuring the consumption with a USB adapter (that I verified was correct already). I’m getting 10.3-10.5W in idle (10.3 when cix_audio_switch is killed and snd_* removed). The RJ45 is connected to GbE, and I’m having a small 128G 2242 SSD. No HDMI. The highest I got was 24.3W when building haproxy on all cores, then 18.75W with “openssl speed -multi 12 rsa2048” on all cores, and 18.5W with multi 8 on the A720 (thus the A520 are quite cheap). It jumps from 10.3 to 14.2W when reading the SSD at 1.18GB/s. I’m seeing 22.6W under rambw. Min and max below:

3 Likes

I also tired a debian testing live iso with 6.10 kernel: https://cdimage.debian.org/cdimage/weekly-live-builds/arm64/iso-hybrid/

Debian’s 6.12 kernel in trixie or sid repo is not bootable.

The iso is built at August 2024, and the system will boot with an old date. I have to set datetime manually to make apt work. And I have to modify the grub install command in calamares at /usr/lib to make the installer run successfully.

2 Likes

I do wonder about Windows not properly detecting CPU caches on that version…

Also if the L3 cache is shared on all cores, benchmarks MIGHT score higher if the A520 cores are disabled. I feel like those cores are just a handbrake and a waste of L3 cache. Specially for Windows.

Core #10 was supposed to be core #0 and the MPIDRs reflect this. I wonder why they haven’t moved the whole cluster at least instead of placing a single A720 core before the A520 cluster.

Ah! I just discovered the same thing by looking at the DTB and at boot messages. In the DTB, we have:

  • 4 A520 little cores
  • 4 A720 medium cores
  • 4 A720 big cores

But the kernel boots off CPU #10 which gets assigned logical CPU id 0. From there, all secondary CPUs are enumerated one at a time, explaining why we’re seeing the A520 from 1 to 5 etc. Then CPU10 is skipped since already used, and it finishes on CPU11 which is the only one at its correct place. This can be verified like this:

$ dmesg|grep '\(CPU\|processor\) 0x0000000'|cut -f2- -d','
cpu0,swapper]Booting Linux on physical CPU 0x0000000a00 [0x410fd811]
cpu1,swapper/1]CPU1: Booted secondary processor 0x0000000000 [0x410fd801]
cpu2,swapper/2]CPU2: Booted secondary processor 0x0000000100 [0x410fd801]
cpu3,swapper/3]CPU3: Booted secondary processor 0x0000000200 [0x410fd801]
cpu4,swapper/4]CPU4: Booted secondary processor 0x0000000300 [0x410fd801]
cpu5,swapper/5]CPU5: Booted secondary processor 0x0000000400 [0x410fd811]
cpu6,swapper/6]CPU6: Booted secondary processor 0x0000000500 [0x410fd811]
cpu7,swapper/7]CPU7: Booted secondary processor 0x0000000600 [0x410fd811]
cpu8,swapper/8]CPU8: Booted secondary processor 0x0000000700 [0x410fd811]
cpu9,swapper/9]CPU9: Booted secondary processor 0x0000000800 [0x410fd811]
cpu10,swapper/10]CPU10: Booted secondary processor 0x0000000900 [0x410fd811]
cpu11,swapper/11]CPU11: Booted secondary processor 0x0000000b00 [0x410fd811]

See these 0xa00 then 0x000, 0x100 … 0x900, then 0xb00. I think they should make the cores appear from biggest to smallest, and then they could boot off CPU0 without requiring any renumbering. I’ll experiment with the DTB to see if it could be sufficient to rearrange them there, though I doubt it.

1 Like

I’ve tried rearranging them in ACPI with Windows and it didn’t work. PSCI in TF-A relies on the current numbering…

Same here with the DTB. As soon as I change the order of the CPUs declared in the DTB, and just the declaration order, nothing else, it reboots during early boot. I also tried to adjust the core numbers/names etc in case it would matter. I’ve tried to adjust the “cpumap” section (has no effect). There’s also a “dsu” part that enumerates CPUs and even some cache_exception_core* but that didn’t fix it. It’s possible that the CPU numbers are hard-coded in some drivers that are not happy to see them changed.

Why would you want to reorder or renumber the CPUs? Is it for technical reasons?

Why would you want to reorder or renumber the CPUs? Is it for technical reasons?

To make it less of a pain to figure and assemble clusters. Usually on big-little (and generally on many-cores) you use “taskset” with everything to assign tasks to the preferred cluster. Same for IRQs which are often assigned using simple bit rotation written in shell in a “for” loop. Having CPUs in random order makes it super complicated to perform manual bindings. Here’s what we currently have:

  • cpu0: core 2 of cluster 2
  • cpu1: core 0 of cluster 0
  • cpu2: core 1 of cluster 0
  • cpu3: core 2 of cluster 0
  • cpu4: core 3 of cluster 0
  • cpu5: core 0 of cluster 1
  • cpu6: core 1 of cluster 1
  • cpu7: core 2 of cluster 1
  • cpu8: core 3 of cluster 1
  • cpu9: core 0 of cluster 2
  • cpu10: core 1 of cluster 2
  • cpu11: core 3 of cluster 2

For manual handling it’s a real pain. I’m currently binding processes using “taskset -c 0,9-11” and “taskset -c 5-8”, none of which is usual nor natural. Even looking at “top” is hard to follow in real time.

Ideally, clusters would be arranged from biggest to smallest so that we’d have the first 4 CPUs being the biggest core, the next 4 ones being the middle cores, and the final 4 CPU the A520 (like is drawn on the SoC diagram). That’s what would make CPU bindings the most agnostic/portable by using the maximum performance (e.g. using only 4 cores only requires you to use 0-3, and using 8 means 0-7, still quite performant). But the reverse approach (0-3=A520, 4-7=middle, 8-11=big) also works, it just requires to remember to use 8-11 for 4 fastest cores and 4-11 for 8.

I can understand why the BIOS boots from one of the biggest CPUs to minimize boot time during decompression (though I’m pretty much convinced the difference is invisible), and it turns out that this chosen CPU becomes CPU0. Thus I think it’s fine if CPUs are arranged from biggest to smallest.

I tried to modify the DTS to have: 10,11,8,9,4,5,6,7,0,1,2,3, but I noticed that while swapping one core with another of the same cluster is OK (e.g. I can swap 4 and 5), any inversion of clusters causes a reboot (e.g. swap just 4 and 3 crashes).

Hoping this helps!

2 Likes

Just as a sidenote: in reality all ARM SoC vendors (except Amlogic with their A311D2 and S928X) do it the other way around (small to big so cpu0 ends up always being the slowest core possible). And now Cix has added some creativity to the mix.

Asides that I fully support your reasoning :slight_smile:

2 Likes

Actually I think they most often start with the most numerous cores and end with the least numerous ones, and since the vast majority of SoCs in the ARM world focus on cost cutting, you end up with a ton of useless little cores that serve marketing to advertise “6 cores!” when only 2 are usable for applications. If you look in the PC world, it’s the exact opposite, due to application trying to bind on first cores, they put the fastest ones first, and end up with the slow ones.

In any case, while I do find it more convenient to start from 0 and pick as many cores as you need to get the best perf, I can also accommodate from the opposite (what I’d call the “rockchip way”, with little before big), as long as they’re all grouped correctly and clusters are monotonically ordered so that a single CPU range can cover all big cores (i.e. not the A520 in the middle). And in the CIX case, the CPU numbering is visible and starts with little then big. At least it should be respected if possible, and if not (or if there’s a good reason for not to), then it should be reversed.

Well at least with Meteor Lake and the top SKUs Intel is doing something similiar to Cix: on an Ultra 9 185H cpu0 is not the fastest core but from the 2nd fastest cluster: 6 P-cores with HT enabled, two of them allowed to clock up to 5.1GHz, four to 4.8GHz and cpu0 being moved out of the 2nd cluster ‘to the top’. But I guess that’s nitpicking and in general you’re right wrt x64.

Apple on the other hand follows the ‘ARM tradition’ with all efficiency cores forming the first cluster.

1 Like

A few tests with LLMs under llama.cpp show good results:

  • deepseek-r1-qwen-14B-IQ4_NL, 8 big cores:
    $ taskset -c 0,5-11 ./build/bin/llama-cli -t 8 -m models/DeepSeek-R1-Distill-Qwen-14B-IQ4_NL.gguf -n 100 -p "Explain to a computer engineer the main differences between ARMv8 and ARMv9" -no-cnv:
    • prompt eval: 15.18 t/s
    • text gen: 4.57 t/s
  • deepseek-r1-qwen-14B-IQ4_NL, 4 biggest cores:
    $ taskset -c 0,9-11 ./build/bin/llama-cli -t 4 -m models/DeepSeek-R1-Distill-Qwen-14B-IQ4_NL.gguf -n 100 -p "Explain to a computer engineer the main differences between ARMv8 and ARMv9" -no-cnv:
    • prompt eval: 9.08 t/s
    • text gen: 4.38 t/s
  • llama-3.1-8b-IQ4_XS, 8 cores:
    • prompt eval: 17.25 t/s
    • text gen: 9.12 t/s
  • llama-3.1-8b-Q8_0, 8 cores:
    • prompt eval: 15.66 t/s
    • text gen: 4.89 t/s
  • ministral-3B-Q5_K_M, 8 cores:
    • prompt eval: 24.20 t/s
    • text gen: 16.48 t/s
  • phi-3.1-mini-IQ4_XS (3B, 128k ctx):
    • prompt eval: 32.97 t/s
    • text gen: 18.42 t/s
  • mistral-nemo-minitron-8B-IQ4_XS:
    • prompt eval: 16.33 t/s
    • text gen: 8.81 t/s

As usual the text gen is memory-bound and the prompt processing is more CPU bound. But the results are very good, especially for the 14B and 8B models in IQ4 quantization.

However llama.cpp is a pain to build on this distro due to the cmake abomination that insists on injecting CPU feature flags that are not supported by the compiler and ignores some of the build options for subsystems. In addition, when gcc-12 builds for “native”, it disables all optimizations like SVE and dotprod that it seems not to recognize. I got bored of trying to fix this after an hour, in the end it was easier to install gcc-14 from a more recent version and build it natively.

The latest Windows build is currently crashing on the P1 and we believe it might be related to this.

Reordering would also help with VMware ESXi, which does not support non-uniform CPUs and consequently also crashes here. The little cluster could be moved to the last position and disabled.