I think I found another reason for cl-mem always reporting strange values. In addition to reading from the cache, it triggers the driver’s watchdog after 2.5B cycles:
[ 2011.117283] [2025:03:30 08:30:27][pid:0,cpu0,in irq]mali 15000000.gpu: [732656769357] Iterator PROGRESS_TIMER timeout notification received for group 0 of ctx 3371_51 on slot 0
[ 2011.133566] [pid:2677,cpu1,kworker/u24:4]mali 15000000.gpu: Notify the event notification thread, forward progress timeout (2621440000 cycles)
And indeed, each test takes roughly 5s. I could shorten them by reducing the number of repetitions (REPS
in config.h) but then the performance looks abysmal, probably because it also counts the time it takes to load that kernel into the GPU, and maybe even initialize the DRAM. I modified the program to reduce the number of loops, unroll loops, and avoid mixing threads in the same cache lines, but the results remain extremely variable and values are low.
I found instead another program which provides me reproducible results and more metrics: clpeak. And the values look perfectly correct. This time it seems to really load the RAM, because CPU numbers always drop and no longer increase during the tests. It performs the measurements using various methods by grouping reads on floats. Here are the new measures with that tool and rambw in parallel:
CFG |
DDR |
Th.BW |
CPU only |
GPU only |
CPU+GPU |
(MHz) |
(MT/s) |
(GB/s) |
(GB/s) |
(GB/s) |
(GB/s [%th]) |
800 |
1600 |
25.6 |
22.4 |
12.48 |
8.7+5.87=14.57 (56.9%) |
1600 |
3200 |
51.2 |
39.8 |
25.34 |
22.5+8.63=31.13 (60.8%) |
2133 |
4266 |
68.3 |
40.3 |
31.6 |
27.6+12.58=40.18 (58.8%) |
2400 |
4800 |
76.8 |
40.3 |
35.44 |
30.3+14.31=44.61 (58.1%) |
2750 |
5500 |
88.0 |
40.3 |
39.5 |
34.0+15.5=49.5 (56.25%) |
3000 |
6000 |
96.0 |
40.3 |
42.56 |
35.9+16.9=52.8 (55.0%) |
3200 |
6400 |
102.4 |
40.3 |
41.62 |
34.9+17.9=52.8 (51.6%) |
So what we see here is that the total bandwidth remains within the usual range around 60% efficiency, and that while it continues to increase past 4266 MT/s, it still reaches a plateau around 6000 MT/s where the GPU gets a bit faster, at the expense of the CPU which gets slower, and the overall efficiency starts to drop there.
So this indicates that there’s definitely something limiting the number of xfers per second that can be achieved with the RAM, though the total performance is quite good and close to expectations for such a chip. I didn’t manage to change the ratios between CPU and GPU by playing with the DRAM settings above (priority per port etc).
I forgot to say a word about power draw:
- my board idles at 12.5W right now (4x3.1+ 4x2.6 + 4x1.8 GHz, DDR5-6400)
- during the CPU test, it peaked at 25.0W
- during the GPU test, it peaked at 21.8W
- during the CPU+GPU test, it reached 28.7W
- running it on all 12 CPU cores + GPU, it reached 30.8W
Also a few numbers slightly differ from the previous test because I discovered a service packagekit
that is wasting the CPU at boot for a few minutes and adds noise, so this time I killed it. (the other things I kill are gdm and cix-audio-switch).
Edit again:
I now found what is limiting the DRAM speed for the CPU above certain levels. It’s the CI700 interconnect frequency. It is set to 1.5 GHz in opp_config_custom.h
. Lowering it to 1 GHz makes the CPU<->DRAM frequency drop to 27.3 GB/s. Increasing it to 1.8 GHz makes the bandwidth reach 47.5 GB/s (vs 40.3), i.e. 46.4% of the DRAM bandwidth instead of 39.4%. However, it crashed during some builds. But the point is not to overclock it (Arm indicates it can reach 2 GHz at 5nm, the chip is in 6nm), but to understand what has an effect on the system’s limits.
Interestingly, the GPU isn’t subject to the CI700 frequency, so the GPU bandwidth remains exactly the same at various levels (upper and lower). And the combined bandwidth changes very little.