Orion O6 Debug Party Invitation

Mario · February 1, 2025, 6:38pm

I’ve tried rearranging them in ACPI with Windows and it didn’t work. PSCI in TF-A relies on the current numbering…

willy · February 1, 2025, 6:48pm

Same here with the DTB. As soon as I change the order of the CPUs declared in the DTB, and just the declaration order, nothing else, it reboots during early boot. I also tried to adjust the core numbers/names etc in case it would matter. I’ve tried to adjust the “cpumap” section (has no effect). There’s also a “dsu” part that enumerates CPUs and even some cache_exception_core* but that didn’t fix it. It’s possible that the CPU numbers are hard-coded in some drivers that are not happy to see them changed.

RadxaNaoki · February 2, 2025, 5:58am

Why would you want to reorder or renumber the CPUs? Is it for technical reasons?

willy · February 2, 2025, 6:51am

Why would you want to reorder or renumber the CPUs? Is it for technical reasons?

To make it less of a pain to figure and assemble clusters. Usually on big-little (and generally on many-cores) you use “taskset” with everything to assign tasks to the preferred cluster. Same for IRQs which are often assigned using simple bit rotation written in shell in a “for” loop. Having CPUs in random order makes it super complicated to perform manual bindings. Here’s what we currently have:

cpu0: core 2 of cluster 2
cpu1: core 0 of cluster 0
cpu2: core 1 of cluster 0
cpu3: core 2 of cluster 0
cpu4: core 3 of cluster 0
cpu5: core 0 of cluster 1
cpu6: core 1 of cluster 1
cpu7: core 2 of cluster 1
cpu8: core 3 of cluster 1
cpu9: core 0 of cluster 2
cpu10: core 1 of cluster 2
cpu11: core 3 of cluster 2

For manual handling it’s a real pain. I’m currently binding processes using “taskset -c 0,9-11” and “taskset -c 5-8”, none of which is usual nor natural. Even looking at “top” is hard to follow in real time.

Ideally, clusters would be arranged from biggest to smallest so that we’d have the first 4 CPUs being the biggest core, the next 4 ones being the middle cores, and the final 4 CPU the A520 (like is drawn on the SoC diagram). That’s what would make CPU bindings the most agnostic/portable by using the maximum performance (e.g. using only 4 cores only requires you to use 0-3, and using 8 means 0-7, still quite performant). But the reverse approach (0-3=A520, 4-7=middle, 8-11=big) also works, it just requires to remember to use 8-11 for 4 fastest cores and 4-11 for 8.

I can understand why the BIOS boots from one of the biggest CPUs to minimize boot time during decompression (though I’m pretty much convinced the difference is invisible), and it turns out that this chosen CPU becomes CPU0. Thus I think it’s fine if CPUs are arranged from biggest to smallest.

I tried to modify the DTS to have: 10,11,8,9,4,5,6,7,0,1,2,3, but I noticed that while swapping one core with another of the same cluster is OK (e.g. I can swap 4 and 5), any inversion of clusters causes a reboot (e.g. swap just 4 and 3 crashes).

Hoping this helps!

tkaiser · February 2, 2025, 10:22am

Just as a sidenote: in reality all ARM SoC vendors (except Amlogic with their A311D2 and S928X) do it the other way around (small to big so cpu0 ends up always being the slowest core possible). And now Cix has added some creativity to the mix.

Asides that I fully support your reasoning

willy · February 2, 2025, 10:03am

Actually I think they most often start with the most numerous cores and end with the least numerous ones, and since the vast majority of SoCs in the ARM world focus on cost cutting, you end up with a ton of useless little cores that serve marketing to advertise “6 cores!” when only 2 are usable for applications. If you look in the PC world, it’s the exact opposite, due to application trying to bind on first cores, they put the fastest ones first, and end up with the slow ones.

In any case, while I do find it more convenient to start from 0 and pick as many cores as you need to get the best perf, I can also accommodate from the opposite (what I’d call the “rockchip way”, with little before big), as long as they’re all grouped correctly and clusters are monotonically ordered so that a single CPU range can cover all big cores (i.e. not the A520 in the middle). And in the CIX case, the CPU numbering is visible and starts with little then big. At least it should be respected if possible, and if not (or if there’s a good reason for not to), then it should be reversed.

tkaiser · February 2, 2025, 10:19am

Well at least with Meteor Lake and the top SKUs Intel is doing something similiar to Cix: on an Ultra 9 185H cpu0 is not the fastest core but from the 2nd fastest cluster: 6 P-cores with HT enabled, two of them allowed to clock up to 5.1GHz, four to 4.8GHz and cpu0 being moved out of the 2nd cluster ‘to the top’. But I guess that’s nitpicking and in general you’re right wrt x64.

Apple on the other hand follows the ‘ARM tradition’ with all efficiency cores forming the first cluster.

willy · February 2, 2025, 10:53am

A few tests with LLMs under llama.cpp show good results:

deepseek-r1-qwen-14B-IQ4_NL, 8 big cores:
$ taskset -c 0,5-11 ./build/bin/llama-cli -t 8 -m models/DeepSeek-R1-Distill-Qwen-14B-IQ4_NL.gguf -n 100 -p "Explain to a computer engineer the main differences between ARMv8 and ARMv9" -no-cnv:
- prompt eval: 15.18 t/s
- text gen: 4.57 t/s
deepseek-r1-qwen-14B-IQ4_NL, 4 biggest cores:
$ taskset -c 0,9-11 ./build/bin/llama-cli -t 4 -m models/DeepSeek-R1-Distill-Qwen-14B-IQ4_NL.gguf -n 100 -p "Explain to a computer engineer the main differences between ARMv8 and ARMv9" -no-cnv:
- prompt eval: 9.08 t/s
- text gen: 4.38 t/s
llama-3.1-8b-IQ4_XS, 8 cores:
- prompt eval: 17.25 t/s
- text gen: 9.12 t/s
llama-3.1-8b-Q8_0, 8 cores:
- prompt eval: 15.66 t/s
- text gen: 4.89 t/s
ministral-3B-Q5_K_M, 8 cores:
- prompt eval: 24.20 t/s
- text gen: 16.48 t/s
phi-3.1-mini-IQ4_XS (3B, 128k ctx):
- prompt eval: 32.97 t/s
- text gen: 18.42 t/s
mistral-nemo-minitron-8B-IQ4_XS:
- prompt eval: 16.33 t/s
- text gen: 8.81 t/s

As usual the text gen is memory-bound and the prompt processing is more CPU bound. But the results are very good, especially for the 14B and 8B models in IQ4 quantization.

However llama.cpp is a pain to build on this distro due to the cmake abomination that insists on injecting CPU feature flags that are not supported by the compiler and ignores some of the build options for subsystems. In addition, when gcc-12 builds for “native”, it disables all optimizations like SVE and dotprod that it seems not to recognize. I got bored of trying to fix this after an hour, in the end it was easier to install gcc-14 from a more recent version and build it natively.

Mario · February 2, 2025, 2:01pm

The latest Windows build is currently crashing on the P1 and we believe it might be related to this.

Reordering would also help with VMware ESXi, which does not support non-uniform CPUs and consequently also crashes here. The little cluster could be moved to the last position and disabled.

DualTachyon · February 2, 2025, 2:06pm

Is “cpuUniformityHardCheckPanic=FALSE” not supported on the ARM version of ESXi?

Mario · February 2, 2025, 2:12pm

Haven’t tried that yet.

geerlingguy · February 2, 2025, 3:23pm

See also: https://github.com/geerlingguy/ollama-benchmark/issues/13

It is a little faster than N150 in some areas, a little slower in others. The performance on this chip so far has been puzzling to me.

willy · February 2, 2025, 3:50pm

Do you know the quantization that was used in your case ? The deepseek-14b is quite different between our two tests (2.63 t/s for you, 4.38 for me at IQ4_NL). Or maybe you left the small cores enabled ? These should absolutely be avoided or they’ll slow everything down and the other cores will spend their time spinning in locks waiting for the small ones to finish. That’s why I used taskset -c 0,5-11 on the command.

willy · February 2, 2025, 4:37pm

Running a few PCIe compatibility tests with various boards. I can confirm that a dual-port Myricom 10G NIC is correctly seen as Gen2 x8:

        Capabilities: [40] Express (v2) Upstream Port, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 0W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <4us, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: Routing-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: EgressBlck-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot

An AMD RX560 graphics card negotiates Gen3 x8:

        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported

A startech USB3.2 adapter however makes the early init code hang:

E/TC:10 00 tee_otp_get_hw_unique_key:123 Get hw key:
I/TC: Primary CPU switching to normal world boot
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0x84400000
INFO:    SPSR = 0x3c9
[44.829] [UEFI] E3C1 XspiInitDxeStart
[44.830] [UEFI] E400 XspiInitDxeEnd
[44.918] [UEFI] E2C1 PcieInitDxeStart

I don’t have the PCI ID but the model is “PEXUSB312A3”.

I noticed that the bottom plate is slightly too large for PCIe cards, it should be shrunk by approx 1.5mm on the back side because it hinders the extension cards’ metallic plate. I had to slightly bend it on all cards for them to fit:

I understand that it’s probably not the primary target, but since that enclosure is really very well suited for development and just requires to remove the top cover to install a card, it would be great if it would fit more naturally.

alchark · February 2, 2025, 6:05pm

Speaking of the built-in display controller, it seems that it uses Armchina Linlon D6 DPU IP, which appears to be an upgraded version of Arm’s earlier Mali D71/D51 DPU. There’s a mainline Linux driver for Mali D71 (Komeda) - I’m wondering how far would one get by trying to bind that driver on Orion P6?

neggles · February 4, 2025, 3:56am

ARM SoCs used in phones and tablets and SBCs put the smallest cores first for power-saving reasons; cpu0 hotplug is not supported* and almost every platform has at least one interrupt etc. that would prevent you from taking cpu0 offline even if it was. There are a number of kernel tasks that like to stick to cpu0 as well, so in practice cpu0 must always be online and will fairly regularly wake up to do Stuff in response to interrupts and timers and the like.

As a result, anything that wants to run Linux/Android in a power-efficient fashion needs cpu0 to be a low-power core, and essentially every ARM SoC is laid out that way. (In other words, blame Android.)

However, for the best experience when running Windows you want cpu0 to be a high-power core because Windows has a lot of tasks that essentially have to run on the first logical CPU, so you’ll find in x86_64 land core 0 is usually a big/“P” core. I suspect that’s why CIX have done what they’ve done with the logical CPU numbers here - putting one of the A720s as the “first” core will provide a markedly better experience on Windows.

* there used to be support for cpu0 hotplug on x86_64, but it was removed in kernel 6.5 and never actually worked to begin with

willy · February 4, 2025, 4:15am

That’s actually an excellent reason I hadn’t thought about, indeed!

tkaiser · February 4, 2025, 7:56am

Great explanation which also hints at the two known exceptions prior to Cix (Amlogic A311D2 and S928X) not having to care that much about energy efficiency since designed for TV boxes.

bnd · February 4, 2025, 1:49pm

Can someone who owns the board print the system register ID_AA64PFR0_EL1 from within the Linux kernel? This will tell use what specific CPU features the new Armv9 cores implement.

Here is the skeleton how to do that.

gist.github.com

https://gist.github.com/abertschi/2055b6cc93573ad1c0e3c4621ac979dd

Makefile

obj-m := read_sysreg.o

KDIR := /lib/modules/$(shell uname -r)/build
PWD  := $(shell pwd)

all:
	$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean

README.md

# Read ID_AA64PFR0_EL1 system register

```
make
sudo insmod ./read_sysreg.ko
sudo dmesg | tail -n 10
```

Install kernel headers first when missing:

This file has been truncated. show original

read_sysreg.c

#include <linux/module.h>
#include <linux/kernel.h>
#include <asm/sysreg.h>

static int __init read_sysreg_init(void)
{
    u64 pfr0 = read_sysreg(ID_AA64PFR0_EL1);
    printk(KERN_INFO "ID_AA64PFR0_EL1: 0x%llx\n", pfr0);
    return 0;
}

This file has been truncated. show original

Many thanks

willy · February 5, 2025, 4:19am

Here it comes, both for A720 and A520:

[25541.164579] [2025:02:05 03:58:53][pid:12224,cpu11,insmod]ID_AA64PFR0_EL1: 0x1201111123111111
[25548.875467] [2025:02:05 03:59:01][pid:12229,cpu11,rmmod]Module unloaded.
[25594.795421] [2025:02:05 03:59:47][pid:12234,cpu1,insmod]ID_AA64PFR0_EL1: 0x1201111123111111
[25600.151633] [2025:02:05 03:59:52][pid:12240,cpu11,rmmod]Module unloaded.

Hoping this helps.

I adapted your module to print other registers. I could only access ID_AA64PFR1_EL1 because neither PFR2 nor ZFR0 which are referenced in the PFR0 doc are known, but if you’re interested in other registers, feel free to update your code to dump them:

[26399.497569] [2025:02:05 04:13:11][pid:13170,cpu0,insmod]ID_AA64PFR0_EL1: 0x1201111123111111
[26399.497581] [pid:13170,cpu0,insmod]ID_AA64PFR1_EL1: 0x10321
[26399.499525] [pid:13169,cpu0,rmmod]Module unloaded.
[26401.649931] [2025:02:05 04:13:14][pid:13175,cpu1,insmod]ID_AA64PFR0_EL1: 0x1201111123111111
[26401.649955] [pid:13175,cpu1,insmod]ID_AA64PFR1_EL1: 0x10321
[26401.656614] [pid:13174,cpu1,rmmod]Module unloaded.

I simplified your module like this:

#define DUMP_REG(reg) printk(KERN_INFO #reg ": 0x%llx\n", read_sysreg(reg))
static int __init read_sysreg_init(void)
{
    DUMP_REG(ID_AA64PFR0_EL1);
    DUMP_REG(ID_AA64PFR1_EL1);
    return 0;
}