DRAM speed capped to 4266 MT/s on O6?

CIX has responded that due to microarchitecture limitation, the CPU cannot fully saturated the bandwidth of the memory bus. Full bandwidth can be achieved when coprocessors are also being stressed. They have provided some instruction to test this with CPU and GPU.

We will verify their solution first before providing instructions.

2 Likes

Thank you for the feedback! That’s quite sad because if so, it means it’s pointless to use LPDDR5 on this device, as LPDDR4X already reaches saturation, and your latest changes to switch from 2750 to 3000 have absolutely no effect :frowning:

2 Likes

I’d imagine that anything directly connecting to the memory bus that supports DMA can benefit from the additional bandwidth, for example external GPU.

1 Like

Maybe. Regardless it’s a bit sad to lose 33% of the expected bandwidth when it can be used efficiently. That said, once they release docs and full source, maybe others will have ideas about things to try to improve the situation, because often such architectural limitations are mostly a tradeoff between different use cases.

1 Like

Just thinking, I’m still having doubts about this. The reason is that when there are limitations somewhere, at least other areas improvements still benefit a little bit (by reducing the cache line fetch time etc), and the curve shouldn’t look like a straight line followed by an immediate plateau. It should normally present a curve around the inflexion point where there are still small improvements that allow to get closer to the internal limit without ever reaching it. Thus I’m suspecting that CIX is just focusing on some known limitations and not necessarily on what is observed. That’s normal to know that a CPU can never use the full DRAM bandwidth, but that’s not the point here, we’re seeing only 40% of the theoretical one. Also, if the limit is due to “internal architectural limitations”, it must still be tied to the frequency of something, be it the core frequency (but here different cores have different frequencies), the DMC frequency, the DDR speed, maybe an internal bus speed, etc. Where is that limitation supposed to happen given that neither CPU cores nor the DDR frequency are involved ? I’m more than willing to have an open (or private if they prefer) discussion with CIX engineers and discussing test methods as well, but I’m still having the feeling that there is some confusion here and that the limit is much lower than what the “internal limitations” should permit.

3 Likes

Here is the guide to stress both CPU and GPU memory.

Some reference value from CIX at 5500MT/s:

Unit: GB/s CPU Read CPU+GPU Read CPU Write CPU+GPU Write
Bandwidth 39.6 33.7+42.9 14 12.1+42.8

Interesting to see write bandwidth is halved like on some AMD Ryzen processors.

1 Like

Thank you Yuntian for this! It’s indeed surprising to see the lower write bandwidth.

In my case, the write bandwidth form the CPU measures exactly the same as the read bandwidth, i.e. 40GB/s. Did you test the CPU using a single core or multiple cores ? FWIW I’m getting 21.4GB/s in each direction on the big cores and 11.3 GB/s on the little ones (single-core test) with rambw.

Also did you retest at 6000 MT/s to confirm you’re seeing a difference ?

CPU write is tested with bw_mem -P 10 which I assume to be 10 parallel jobs. Could also be big small core like you said when scheduler is messing with me.

I did not. I only confirmed that the combined bandwidth is higher than pure CPU test, which means there is some bandwidth that are unavailable to CPU, and the original pure CPU test does not fully saturate the memory controller.

1 Like

Yes, maybe the tool gives the same total job to all threads and waits for them to complete, or something like this. You can test with taskset -c 0,5-11 to use only big+med cores, thus logically -P 8. I’ll try your tests myself, maybe one evening this week otherwise next week-end. I’ll compare the 2750 and 3000 settings, and I can also try with the Merak image that offers the choice in the BIOS. I don’t know if we can copy that section to the O6 subdir like was proposed for the CPU frequency, and automatically benefit from that menu entry, that could be simpler to manipulate if it works.

Yeah that’s quite useful to know that already, even if unpleasant to see that in the end it’s only 40% of the available bandwidth, i.e. having two channels in fact only benefits the GPU. I hope that once the SoC doc becomes available we’ll find ways to overcome this at least in part.

It was disabled specifically because it was hardcoded with EVB hardware info (like power controlling GPIOs), which is why it was disabled in our regular build, and only verified features were provided under Device Manager.

You can try enable it (probably the TOKEN SETUP mentioned above), and the code should handle this gracefully. But we cannot guarantee everything works well.

OK thank you. I might eventually have a look. I have exactly zero knowledge of EDK2 and I must say that the whole project looks very much “windowsy” to me :slight_smile: Thus it takes time to discover the magics in it everywhere (stuff that is automatically selected to be built, files that serve at some point to do obscure things etc). But I understand your point regarding the EVB and GPIO, that totally makes sense, of course!

Open your build_and_package.sh script then append -D TOKEN_SETUP_SUPPORT=TRUE to the EDK2 build command.

The copy/paste dance for mem_config and pm_config is just for that bash script to pick them up. The script only builds those config blobs if their sources are present in the platform directory, thus overriding the default prebuilt ones.

1 Like

Thanks, will try. To be honest I have idea what this does in the build system, but since you already found other parts I guess you’re way more at ease than me in this ocean of components :wink:

Hi @RadxaYuntian,

I’ve run your utility on my board. It’s interesting. It shows the GPU has top priority over the CPU, because under load or in idle, the GPU’s performance doesn’t change. I’m seeing:

  • 47.3 GB/s GPU read
  • 48.8 GB/s GPU write
  • 48.8 GB/s GPU copy

The impacts on the CPU read is the following:

  • no GPU access: 40.9 GB/s CPU read
  • GPU reading: 40.1 GB/s CPU read
  • GPU writing: 31.0 GB/s CPU read
  • GPU copying: 32.8 GB/S CPU read

For CPU writes it’s different. First, I said I was surprised to measure the exact same perf as reads but I was wrong, I found that I need to use the generic code only, the other one still does reads… I’m measuring 35.7 GB/s from the 4 little cores, 18.0 GB/s from the big cores, and 17.7 GB/s from the medium cores. Quite strange. Big+medium combined achieve 20.8 GB/s. Thus I’m running the perf test on the little cores which are twice as fast here:

The impact on little CPU writes is the following:

  • no GPU access: 35.7 GB/s CPU write
  • GPU reading: 27.5 GB/s CPU write
  • GPU writing: 27.9 GB/s CPU write
  • GPU copying: 27.7 GB/S CPU write

The impact on big CPU writes is the following:

  • no GPU access: 18.0 GB/s CPU write
  • GPU reading: 16.8 GB/s CPU write
  • GPU writing: 17.2 GB/s CPU write
  • GPU copying: 17.1 GB/S CPU write

And again, numbers do not change for the GPU test.

I remember noticing similar things where big cores have much slower writes than little cores on RK3399 and seem to remember that it was caused by priority queues on the DMC or L3 cache, I don’t remember the details to be honest. Maybe we’ll find ways to improve this in the future. Or some might start to develop DMA accelerators as software in the GPU to assist the CPU :-/

Note that my DRAM is set to 6000 MT/s, big CPU to 3.0 GHz and med CPU to 2.4 GHz in this test. I’ll rerun more tests once I manage to make these parts configurable in the BIOS.

2 Likes

So I tried this without changing the mem_config part, and saw the menu appear but it had no effect at all. I’ll retry with the mem_config from CIX to compare.

In the menu, I noticed config items about port 0 & 1 priorities. I’m wondering if that could have a relation with CPU vs GPU access to the DRAM. In my tests it didn’t change anything due to the overall setup code being ineffective, but I’ll pursue my tests on this. I noticed that in the O6 code they’re set to QoS mode and 100+100, while Merak sets them to 30+70 and shows similar performance so I don’t expect much change from this but preferred to share the info in case it sparks a light to someone.

OK I’ve run several tests with the regular BIOS and setting MEM_CFG_MEMFREQ at build time.

The first observation is that the cl-mem program reports totally incorrect values. For all DRAM frequencies, it reports 48.8 GB/s. At first glance, the input.cl program contains a loop which uses the thread number in the lowest address bits, so when you have 16 threads each reading 32 bits in parallel in a same cache line, in reality the cache line is read once and 15 other threads fetch the contents from the cache. I’ll try to fix it to make sure that each thread reads a separate cache line, or possibly even to implement the opencl part in ramspeed in hope to support using both CPU and GPU at the same time much easier.

The second observation is that at certain frequencies, running cl-mem in parallel increases the CPU bandwidth. I suspect that by keeping the DRAM controller busy, it allows it to perform back-to-back accesses and prevents it from going idle.

In addition, it looks like there’s indeed an increase in CPU bandwidth even past 5500 MT/s when cl-mem is running in parallel, which would indicate that the DRAM frequency setting seems to be effective. I’ve run the measurements on the big cluster, running at 3.1 GHz.

Here are some measured values. Columns indicate:

  • the MEM_CFG_MEMFREQ value (in MHz)
  • the equivalent DDR speed
  • the corresponding theoretical DRAM bandwidth in GB/s (upper limit)
  • the CPU-only read speed in GB/s + percent of theoretical limit
  • the GPU-only read speed in GB/s + percent of theoretical limit
  • the CPU read speed during GPU test, in GB/s + percent of theoretical limit
  • the the sum of CPU+GPU combined test + percent of theoretical limit

I’ve highlighted the abnormal values (too high percentages due to cl-mem reporting garbage) and the surprising ones (CPU being faster during GPU test than without GPU):

CFG DDR Th.BW CPU only GPU only CPU during GPU CPU+GPU
(MHz) (MT/s) (GB/s) (GB/s [%th]) (GB/s [%th]) (GB/s [%th]) (GB/s [%th])
800 1600 25.6 19.0 (74.2%) 48.8 (190%) 9.6 (37.5%) 58.4 (228%)
1600 3200 51.2 37.4 (73.0%) 48.8 (95.3%) 23.3 (45.5%) 72.1 (141%)
2133 4266 68.3 39.7 (58.2%) 48.8 (71.5%) 29.1 (42.6%) 77.9 (114%)
2400 4800 76.8 39.9 (51.9%) 48.8 (63.5%) 31.6 (41.1%) 80.4 (105%)
2750 5500 88.0 40.1 (45.5%) 48.8 (55.4%) 35.9 (40.8%) 84.7 (96.2%)
3000 6000 96.0 40.3 (42.0%) 48.8 (50.8%) 41.6 (43.3%) 90.4 (94.2%)
3200 6400 102.4 40.3 (39.4%) 48.8 (47.7%) 44.7 (43.7%) 89.5 (87.4%)

Does anybody from CIX / Radxa know what the port 0 / port 1 correspond to here in the priority settings ? I’ll run some tests to see if I figure a way to make the CPU run faster, but getting more info about what the settings correspond to would be helpful. Of particular interest to me is edk2-platforms/Platform/Radxa/Orion/O6/mem_config/default/MemFeatures.c:

  // Performance feature
    .DataMask         = 1,
    .RfmEn            = 0,
    .WckAlwaysOn      = 1,
    .AutoPrechargeEn  = 1,
    .PbrEn            = 1,
    .SelBankInQ       = 1,
    .PortPriority     = PORT_PRIORITY_QOS,  // also supports PORT_PRIORITY_FIXED
    .BdwP0            = 100,  // used with QOS; port 0 percent from 1 to 100
    .BdwP1            = 100,  // Port 1 percent
    .BdwOvflowP0      = 1,
    .BdwOvflowP1      = 1,
    .RPriorityP0      = 4,  // used with fixed priorities
    .WPriorityP0      = 4,
    .RPriorityP1      = 4,
    .WPriorityP1      = 4,
    .AddressMode      = 0,
    .WrLEcc           = 0,
    .RdLEcc           = 0,
    .IEcc             = 0,
1 Like

I think I found another reason for cl-mem always reporting strange values. In addition to reading from the cache, it triggers the driver’s watchdog after 2.5B cycles:

[ 2011.117283] [2025:03:30 08:30:27][pid:0,cpu0,in irq]mali 15000000.gpu: [732656769357] Iterator PROGRESS_TIMER timeout notification received for group 0 of ctx 3371_51 on slot 0
[ 2011.133566] [pid:2677,cpu1,kworker/u24:4]mali 15000000.gpu: Notify the event notification thread, forward progress timeout (2621440000 cycles)

And indeed, each test takes roughly 5s. I could shorten them by reducing the number of repetitions (REPS in config.h) but then the performance looks abysmal, probably because it also counts the time it takes to load that kernel into the GPU, and maybe even initialize the DRAM. I modified the program to reduce the number of loops, unroll loops, and avoid mixing threads in the same cache lines, but the results remain extremely variable and values are low.

I found instead another program which provides me reproducible results and more metrics: clpeak. And the values look perfectly correct. This time it seems to really load the RAM, because CPU numbers always drop and no longer increase during the tests. It performs the measurements using various methods by grouping reads on floats. Here are the new measures with that tool and rambw in parallel:

CFG DDR Th.BW CPU only GPU only CPU+GPU
(MHz) (MT/s) (GB/s) (GB/s) (GB/s) (GB/s [%th])
800 1600 25.6 22.4 12.48 8.7+5.87=14.57 (56.9%)
1600 3200 51.2 39.8 25.34 22.5+8.63=31.13 (60.8%)
2133 4266 68.3 40.3 31.6 27.6+12.58=40.18 (58.8%)
2400 4800 76.8 40.3 35.44 30.3+14.31=44.61 (58.1%)
2750 5500 88.0 40.3 39.5 34.0+15.5=49.5 (56.25%)
3000 6000 96.0 40.3 42.56 35.9+16.9=52.8 (55.0%)
3200 6400 102.4 40.3 41.62 34.9+17.9=52.8 (51.6%)

So what we see here is that the total bandwidth remains within the usual range around 60% efficiency, and that while it continues to increase past 4266 MT/s, it still reaches a plateau around 6000 MT/s where the GPU gets a bit faster, at the expense of the CPU which gets slower, and the overall efficiency starts to drop there.

So this indicates that there’s definitely something limiting the number of xfers per second that can be achieved with the RAM, though the total performance is quite good and close to expectations for such a chip. I didn’t manage to change the ratios between CPU and GPU by playing with the DRAM settings above (priority per port etc).

I forgot to say a word about power draw:

  • my board idles at 12.5W right now (4x3.1+ 4x2.6 + 4x1.8 GHz, DDR5-6400)
  • during the CPU test, it peaked at 25.0W
  • during the GPU test, it peaked at 21.8W
  • during the CPU+GPU test, it reached 28.7W
  • running it on all 12 CPU cores + GPU, it reached 30.8W

Also a few numbers slightly differ from the previous test because I discovered a service packagekit that is wasting the CPU at boot for a few minutes and adds noise, so this time I killed it. (the other things I kill are gdm and cix-audio-switch).

Edit again:
I now found what is limiting the DRAM speed for the CPU above certain levels. It’s the CI700 interconnect frequency. It is set to 1.5 GHz in opp_config_custom.h. Lowering it to 1 GHz makes the CPU<->DRAM frequency drop to 27.3 GB/s. Increasing it to 1.8 GHz makes the bandwidth reach 47.5 GB/s (vs 40.3), i.e. 46.4% of the DRAM bandwidth instead of 39.4%. However, it crashed during some builds. But the point is not to overclock it (Arm indicates it can reach 2 GHz at 5nm, the chip is in 6nm), but to understand what has an effect on the system’s limits.

Interestingly, the GPU isn’t subject to the CI700 frequency, so the GPU bandwidth remains exactly the same at various levels (upper and lower). And the combined bandwidth changes very little.

6 Likes

For those interested in reproducing the experiment, don’t hold your breath too much. It appears that this board has a particularly good chip. My new one (32GB arrived yesterday) has less margin, and doesn’t boot with the DSU at 1.5 GHz or above (vs 1.6 GHz for this one), and sees occasional crashes with the first big cluster at 3.0 GHz.

However, 2.8GHz for the big cluster, 2.6 GHz for the medium one, 3200 MHz for the DRAM and 1.6 GHz for the CI700 is quite OK and mostly within specs, and delivers 43 GB/s for the CPU, 49 GB/s for the GPU, and 41+15.3=56.3 for CPU+GPU (55% efficiency). So I guess we won’t have much more to grab here.

1 Like

This is nothing unusual,
Pi5 can overclock from 2.8 up to 3.2 depending on particular unit. You may get 5 of them from shop shelf and only some of them will be stable at higher clock speed.
Who is surprised at all? This is reality, thanks @willy for fair report!

I mean I’m not surprised either, my point was essentially to give more feedback since till now there were only two reports of OC, both very successful, I prefer not to make people dream and buy it with that goal in mind then be disappointed. I think that most of the performance will come from upping the med cores anyway, since they’re pretty low on the default BIOS, so there’s no point pushing the machine to its limits, it’s already pretty good like that. Even the DDR doesn’t bring anything past 5500 already. That’s a bit sad but that’s how it is. I’m keeping my machine running at 6400 in order to maximize the chance to detect a glitch, but I’ll down it later.

1 Like