DRAM speed on ROCK 5 ITX

Hello,

I have been wondering if DRAM speed could be an explanation for the ROCK 5 ITX compiling slightly slower than the ROCK 5B. Compile times are interesting because they’re extremely sensitive to DRAM speed (well, a bit less on this CPU which has 3MB of shared L3 cache, but still).

My tests are always the same, I’m compiling the exact same code (haproxy-3.0.0) with a canadian gcc-4.7 producing code for i586. The same binary compiler is used on both platforms. The difference is not huge but noticeable. My board boots with the DDR init code v1.16 at 2400 MHz. I wanted to test the 2736 MHz but didn’t want to take the risk of bricking the board by flashing something that wouldn’t boot. But I found on the board’s schematic that it’s possible to force it to boot from SD by connecting the maskrom button’s output to a 20k resistor connected to the ground. Thus I did this, booting Joshua Riek’s ubuntu image version 2.1.0 which still employs v1.11 at 2736 MHz while 2.2.0 adopted v1.16. It worked fine and in order to exclude any risk of kernel/userland differences (though I know by experience that this test really does not depend on these), I booted the same image and only used the SD to load u-boot.

The results are the following. I’ve measured the time taken to build using both a 32 and a 64-bit compiler since 32-bit ones are always faster than the 64-bit ones. This run on either all 8 cores, or only the 4 big ones. The values are the time it takes to compile the code (averaged on 3-4 values due to +/- 0.15s variation between tests), lower is better:

Test ROCK 5B LPDDR4X-4224 ROCK-5-ITX LPDDR5-4800 ROCK-5-ITX LPDDR5-5472
4x32b 24.4 25.6 24.2
4x64b 25.4 26.5 25.2
8x32b 19.3 19.8 19.1
8x64b 20.9 21.5 20.9

Thus it started to become obvious that LPDDR5 is far from being on par with LPDDR4X, because it takes 2736 MHz to catch up with the 2112 MHz one. I’ve compared the ramlat tests on each setup and found something quite intriguing, look at stable window sizes from 32M to 128M, we’re seeing this:

  • ROCK 5B - LPDDR4X-4224
       size:  1x32  2x32  1x64  2x64 1xPTR 2xPTR 4xPTR 8xPTR
     32768k: 132.4 133.8 132.0 133.1 135.7 129.3 125.9 125.1`
     65536k: 138.8 139.3 138.7 139.2 138.6 131.8 128.8 131.3`
    131072k: 142.6 142.5 142.6 142.4 143.1 136.1 133.7 136.1`
  • ROCK 5 ITX - LPDDR5-4800
       size:  1x32  2x32  1x64  2x64 1xPTR 2xPTR 4xPTR 8xPTR
     32768k: 237.8 233.3 237.5 233.3 237.4 231.3 231.6 233.7
     65536k: 248.4 243.9 246.8 242.8 246.2 241.5 242.0 243.5
    131072k: 251.7 249.1 251.1 248.8 251.1 247.1 246.9 248.5
  • ROCK 5 ITX - LPDDR5-5472
       size:  1x32  2x32  1x64  2x64 1xPTR 2xPTR 4xPTR 8xPTR
     32768k: 140.8 140.0 140.8 139.8 140.8 137.4 138.8 132.4
     65536k: 145.5 143.5 145.3 143.2 145.3 142.4 142.8 143.0
    131072k: 147.1 145.5 146.6 145.4 146.8 144.2 144.6 148.4

So roughly speaking we’re seeing that ROCK 5B has a latency of around 138ns, while ROCK 5 ITX at 2736 MHz (initial goal) is around 145ns hence slightly slower, though we have no reason to expect it to be faster since it’s normally unrelated to the 30% frequency increase. And ROCK 5 ITX at 2400 MHz (the new default) is around 245 ns, or 70% slower than when configured at 2736 MHz.

The measured bandwidth on the other hand, only varies by ~7%, which might be why this remained unnoticed till now.

Thus I’m wondering a few things related to this:

  • on a few excessively laconic git commits for rkbin, I’ve just seen “improve stability” to justify the roll back from 2736 to 2400, without any mention of any reported issue nor any precise concern. There’s not even a mention of any attempts to run at other intermediary speeds (e.g. 2666 which might possibly match some existing effective operating points). How were the issues met ? Were other frequencies tested ?
  • Is the issue related to the signals routing on the ROCK 5 ITX board, to the DRAM chips or to the SoC itself ? The first two ones are the only ones which would leave hope that a future version could bring the higher frequency back.
  • the +70% latency increase at 2400 MHz compared to 2736 MHz looks totally abnormal and might result from a bad timing somewhere (an improperly coded RAS or CAS counter maybe?). It would be nice if someone involved in this change had a look at it to figure what’s really happening there.

For the record I performed the measurement this way:

   git clone https://github.com/wtarreau/ramspeed
   cd ramspeed
   make -j
   taskset -c 7 ./ramlat -n -s 100 524288

Out of curiosity, what is a “Canadian gcc”?

Sorry, short name for “canadian cross compiler” which designate a compiler that you build on an architecture A to be executed on an architecture B in order to produce code for architecture C. It’s just an extension of a cross-compiler. It’s common in build farms, because usually the build process is painful and requires many dependencies so you build your compilers on a fast machine with all dependencies met (typically x86), for execution on what will become your target machine(s) which can be embedded ones. These ones will then be used to produce the final code (typically called by distcc). Overall these compilers tend to be reasonably portable because they don’t depend on the build system and didn’t bring anything from it by accident. Hoping this clarifies the situation a little bit :slight_smile:

A bit, but the mystery why it is “Canadian” remains.

I once read the explanation for this but I must confess I forgot. The name remains in toolchain builders like crosstool-ng. I vaguely remember something like a pun around “canadian cross” but I’m not really sure about that anymore, and not being a native speaker myself doesn’t help spot these :wink:

1 Like

Slightly off topic, but are you Willy T, as in the founder of haproxy? I’ve been reading your posts here awhile and it has only just started to click.

Yes that’s me indeed. But that should not give me more credit for my posts :wink:

2 Likes

Haha, I’ll refrain then.

But seriously, haproxy is my favorite piece of software. I’ve deployed it in so many solutions for mainly performance and observability reasons. Thanks so much for sticking with it and having the documentation translated into English. :joy:

I’ve now added a reproducer to the ramspeed repository:

   git clone https://github.com/wtarreau/ramspeed
   cd ramspeed
   make -j
   time taskset -c 7 ./ramwalk 10

It takes 176s to execute on my skylake, 417 on the rock5b and 745 on the rock5-itx, or 79% slower.

1 Like

@tkaiser figured that the dmc frequency could likely be causing this, and indeed, while testing I found that when set to dmc_ondemand (the default) or simple_ondemand, the frequency jumps to 2400 MHz suring the memory filling phase, then drops to 536 MHz during the memory scanning phase. I’ve forced the min frequency to 2400 MHz and rerun the test. It now ran in 451 seconds, which is quite closer to the rock5b (8% slower instead of 80% slower).

Now I’m getting these ramlat timings:

  • ROCK 5 ITX - LPDDR5-4800
   size:  1x32  2x32  1x64  2x64 1xPTR 2xPTR 4xPTR 8xPTR
 32768k: 151.4 147.2 147.3 145.0 147.0 142.3 140.5 143.1 
 65536k: 151.5 149.7 151.0 149.5 151.1 147.3 147.0 149.9 
131072k: 153.2 151.0 152.8 150.8 152.8 150.2 149.6 152.0 

I don’t know why the DMC’s defaults work differently depending on the ddr_init code, but at least now we have an explanation for this important perf regression between the two init codes. It’s a bit strange also that it remains at the lowest frequency during reads…

Normally modprobe governor_performance at boot should definitely work around this.

I found something which might be of interest to those following this.

I finally managed to boot mainline 6.12-rc6 on the Rock 5 ITX (thanks to Heiko Stübner’s help comparing our setups, more on that later in another post). I found that the RAM was slow again on the board, but this time it was only slow for tasks running on the A76 cores (4-7) and not on the A55 ones (0-3). I suspected that, like on some RK3399, making the little cores work would unlock the performance, and that was exactly the case!

I ended up setting the small cores (cluster 0 // cpufreq policy0) to performance and that simply solved the issue. There’s no directly accessible dmc driver on mainline, so I suspect that the controller’s frequency or governor is mapped one way or another to the first cluster’s governor. As long as that one stays at low speed, the DRAM is slow. It suffices to turn it on so that the machine works fine:

$ echo ondemand > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
$ taskset -c 4-7 ./rambw 
15566
$ echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
$ taskset -c 4-7 ./rambw 
24944

(The values are in MB/s so the RAM works at 15.5GB/s instead of 25 GB/s).

In case of low performance, please double-check your governors (and/or run the test above from ramspeed).

3 Likes