Numa emulation on rk3588

well it does not improve the performance as it does in raspberry pi.

I have a rebased the patch to rockchip kernel here. With numa=fake=4 cores it does not improve the geekbench score. Even reduces to a point. Pci bus even compains about it.

NUMA: No NUMA configuration found
Dec 26 15:31:11 alarm kernel: Faking a node at [mem 0x0000000000200000-0x00000000c017ffff]
Dec 26 15:31:11 alarm kernel: Faking a node at [mem 0x00000000c0180000-0x00000001800fffff]
Dec 26 15:31:11 alarm kernel: Faking a node at [mem 0x0000000180100000-0x000000024007ffff]
Dec 26 15:31:11 alarm kernel: Faking a node at [mem 0x0000000240080000-0x00000002ffffffff]
Dec 26 15:31:11 alarm kernel: NUMA: NODE_DATA [mem 0xc017da40-0xc017ffff]
Dec 26 15:31:11 alarm kernel: NUMA: NODE_DATA [mem 0x1800fda40-0x1800fffff]
Dec 26 15:31:11 alarm kernel: NUMA: NODE_DATA [mem 0x1ffffda40-0x1ffffffff]
Dec 26 15:31:11 alarm kernel: NUMA: NODE_DATA [mem 0x2feea4a40-0x2feea6fff]
Dec 26 15:31:11 alarm kernel: Zone ranges:
Dec 26 15:31:11 alarm kernel:   DMA      [mem 0x0000000000200000-0x00000000ffffffff]
Dec 26 15:31:11 alarm kernel:   DMA32    empty
Dec 26 15:31:11 alarm kernel:   Normal   [mem 0x0000000100000000-0x00000002ffffffff]
Dec 26 15:31:11 alarm kernel: Movable zone start for each node
Dec 26 15:31:11 alarm kernel: Early memory node ranges
Dec 26 15:31:11 alarm kernel:   node   0: [mem 0x0000000000200000-0x00000000c017ffff]
Dec 26 15:31:11 alarm kernel:   node   1: [mem 0x00000000c0180000-0x00000000efffffff]
Dec 26 15:31:11 alarm kernel:   node   1: [mem 0x0000000100000000-0x00000001800fffff]
Dec 26 15:31:11 alarm kernel:   node   2: [mem 0x0000000180100000-0x00000001ffffffff]
Dec 26 15:31:11 alarm kernel:   node   3: [mem 0x00000002f0000000-0x00000002ffffffff]
Dec 26 15:31:11 alarm kernel: Initmem setup node 0 [mem 0x0000000000200000-0x00000000c017ffff]
Dec 26 15:31:11 alarm kernel: Initmem setup node 1 [mem 0x00000000c0180000-0x00000001800fffff]
Dec 26 15:31:11 alarm kernel: Initmem setup node 2 [mem 0x0000000180100000-0x00000001ffffffff]
Dec 26 15:31:11 alarm kernel: Initmem setup node 3 [mem 0x00000002f0000000-0x00000002ffffffff]
....
[    3.901772] pci_bus 0002:20: Unknown NUMA node; performance will be reduced
[    3.909246] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[    4.225157] pci_bus 0004:40: Unknown NUMA node; performance will be reduced

In case anyone wonders the results, or has any suggestion to check something else.

UPDATE: i noticed that i havent issued the command with numactl, that invalidates all results, let me retest again.

UPADTE 2: After running with numactl --interleave=all ./src/Geekbench-6.3.0-LinuxARMPreview/geekbench_aarch64 score still did not change. So i am bailing out for now.

1 Like

Why should it?

The sole reason for this NUMA emulation thingy done with RPi 5 is to hide BCM2712’s terrible memory performance by improving the combined score of a single synthetic benchmark that does not repesent anything happening in the real world.

BCM2712 consists of four A76 cores yet before this NUMA patch thingy Geekbench 6 shows multi-core scores just twice as much as single-core scores which is a clear indication that there’s something wrong: both with the SoC and the benchmark in question.

If we disable the A55 cores on RK3588 designs we again get a quad-core A76 CPU where the GB6 multi-core score is at least three times higher than the single-core score of this specific benchmark:

https://browser.geekbench.com/v6/cpu/compare/5377105?baseline=5424972

Now with this NUMA nonsense the multi-core score of BCM2712 on GB6 improves by a few percent. And this only applies to this specific combination of SoC (or more precisely the memory controller inside) and benchmark version. Choose any other benchmark where the multi-threaded execution is not as flawed as Geekbench 6 (Geekbench 5 will already do since here RPi 5 scores almost thee times higher multi vs. single) and this NUMA emulation hack doesn’t change anything. Same applies to replacing BCM2712 with any other SoC not as broken wrt memory performance.

Geekbench 6 (at least on any platform other than x64) is garbage especially wrt multi-core scores: see here @geerlingguy testing a 192-core Ampere Altra setup:

Single vs. multi:

  • Geekbench 6: 1309 vs. 15160 (192 cores score 11.6 times faster than a single core)
  • Geekbench 5: 958 vs. 80639 (192 cores score 84 times faster than a single core)
  • 7-zip MIPS (v16.02): 4783 vs. 745720 (192 cores score 156 times faster than a single core)

I really don’t get why people (still) rely on this Geekbench garbage… and on other platforms like RISC-V it’s even worse than the pathetic situation we observe on ARM…

EDIT: @geerlingguy corrected me wrt benchmark situation on RPi 5B. At least HPL and Ollama do also benefit from this NUMA emulation stuff in the range of 4%-5%.

3 Likes

Absolutely!

Also it’s important to keep in mind that the RPi5 still has a prehistoric single-channel 32-bit memory bus. 32 bits!!! For 4 cores, when most workloads nowadays are memory-bound! Other SoCs featuring reasonably modern cores have at least 64 bits, sometimes even more, and split into multiple channels (the Altra I have at work has six 64-bit channels, the LX2160A has two 64-bit ones). RK3588 splits its 64 bits in 4x16 bit channels which means that the 4 big cores can have a relative independence providing low latency when fetching from or writing to multiple areas in parallel. On an RPi5 you need to wait an eternity for the operation in progress to complete, that’s minimum 16 memory bus cycles for a single cache line to finish before the 3 other cores can fight again for whom will be granted access to the RAM.

While I was attracted by the correct cores in the RPi5 (at last!), I never bought one due to its ridiculous memory controller that makes it unfit for most tasks, as is even shown in benchmarks like above, and in the patches that try hard to work around that misdesign.

3 Likes

well at least a mystery is solved…

1 Like

I don’t think BCM2712 is a ‘misdesign’, it’s just that Set-Top boxes don’t need performant memory access.

Unlike RPi guys are telling the BCM2712 just like any other SoC found on Raspberries before is not an ‘application processor’ designed for or by them but simply a silicon variant of a VideoCore setup designed for something entirely different. In BCM2712’s case BroadCom’s Muskoka STB platform:

1 Like

Agreed Thomas, I was not clear, I meant that the “misdesign” here is the use of such an unfit SoC in a generic board designed to run applications. That said, one can still question the purpose of putting 4 big cores in an SoC with so little bandwidth. But there are probably valid applications of this device, though I think they would already be fine with only two cores…