ROCK 5B Debug Party Invitation

tkaiser · August 30, 2022, 10:36am

Nope, this is a long debunked myth originating from FreeNAS/TrueNAS forums: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

Quoting one of the ZFS designers, Matthew Ahrens:

There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.

neggles · August 30, 2022, 3:27pm

Yeah, there’s nothing about ZFS that mandates ECC memory any more than anything else. I also didn’t say anything about using ZFS on Rock5 though it would probably work quite well, even if there wouldn’t be any support for the hardware decompression engine.

NVMe performance is not “really bad” either - I’ve been running my main buildbox off ZFS-on-NVMe for about a year and a half now, it’s pretty ridiculously fast. I went from a full OpenWrt image build taking well over an hour (on 4x250GB SATA SSD in RAID 0) to 25 minutes (those same 4x250GB SSDs as single-disk ZFS vdevs) to less than 10 minutes on a single-drive NVMe ZFS pool. And it’s not even a particularly good NVMe drive… Sure, most of the boost there comes from ARC, but still.

ANYWAY. This is off-topic. Point being, once I get my hands on a Rock5 i’ll be sure to give the H10 a try

Dante4 · August 30, 2022, 7:54pm

Right now mdadm gives me 12 GB/s, while ZFS merely 3-4GB/s (7 NVMe drives). Not like it’s matter with rock5, but in total nvme performance is terrible.

And yes, that’s going off-topic.

tkaiser · August 31, 2022, 7:08pm

To whom it may concern… @lanefu @Tonymac32 ?

Distro	Clockspeed	Kernel	7-zip	AES-256 (16 KB)	memcpy	memset
Radxa Focal	2350/1830 MHz	5.10	16450	1337540	10830	29220
Armbian Focal	2350/1830 MHz	5.10	14670	1339400	10430	29130

A 11% drop in 7-ZIP MIPS (that is sensitive to memory latency).

Armbian image freshly built 2 days ago and Radxa image not updated wrt bootloader/kernel. On the Armbian image memory latency and bandwidth of the A55 cluster is completely trashed.

As such some stuff to babble about in your hidden Discord crap channels…

@willy: there are some discrepancies between tinymembench latency measurements and ramlat. The former clearly shows increased latency starting with 4M sizes. ramlat only shows on the cpu6 measurement a huge difference to the older benchmark made with Radxa’s OS image.

What am I’m doing wrong when calling ramlat?

willy · August 31, 2022, 8:32pm

I don’t think you’re doing anything wrong. Memory latencies are extremely dependent on the walk patten because latencies basically didn’t change in 30 years (never went lower than 50-60ns), and everything else is solely dependent on optimizing transfers on some access patterns. As such, in order to perform such measurements, tools like ramlat and tinymembench need to proceed along non-predictable walk patterns that don’t cost too much in CPU usage, to both are probably using different algos, resulting in different measurements.

However I’m very surprised by the comparison between the two images, because:

tinymembench sees CPU4 twice as slow on focal
ramlat sees CPU6 twice as slow on focal

I’m wondering if it’s not just that one cluster (or even one core) randomly gets punished during one of these tests. Maybe if you tun tinymembench 5 times and ramlat 5 times, both will agree on seeing that one cluster is regularly much slower, and either the affected cluster will always be the same for a given tool, possibly providing a hint about an access pattern triggering the problem, or maybe it will happen on different clusters in a random fashion. If you look at tinymembench on cluster 0, it got one measurement significantly off at 32 MB. That’s also a hint that there could be something not very stable during these measurements.

tkaiser · September 1, 2022, 5:28pm

I switched to performance cpufreq governor and then gave it a try:

ramlat:

Consistent results with the difference starting at 4M. With Radxa’s old/original image it looks like this:

  4096k: 56.92 43.23 51.38 42.95 51.56 44.68 46.86 59.27 
  8192k: 99.71 85.63 98.51 85.03 97.64 86.10 84.90 91.85 
 16384k: 119.4 119.1 120.0 109.2 118.5 109.4 109.6 107.9

And with the freshly built Armbian image (also including recent commits from Radxa’s kernel repo) it looks like this on the A76:

  4096k: 100.3 75.80 87.20 74.90 86.04 75.01 80.96 111.3 
  8192k: 176.2 156.5 170.9 155.1 172.0 150.6 157.8 182.2 
 16384k: 212.5 201.0 210.3 200.5 210.7 194.6 203.9 216.8

Most probably I already found the culprit since Radxa enabled dmc on Aug 22, 2022 :

root@rock-5b:/sys/class/devfreq/dmc# cat governor 
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# cat available_governors 
dmc_ondemand userspace powersave performance simple_ondemand

So to get performance back something like what I added years ago to a now unmaintained Armbian script might be needed…

Now performing tinymembench runs in a similar fashion but this will take ages. I’ll update this post later.

Edit: switching back to performance dmc governor restores initial memory latency / performance:

  4096k: 56.15 43.70 50.32 43.57 50.23 44.84 47.17 60.30 
  8192k: 102.2 86.42 98.15 86.13 98.31 86.02 85.37 93.44 
 16384k: 123.1 112.3 120.5 111.8 120.5 112.8 115.3 111.7

Meanwhile over at Armbian only the usual amount of ignorance and stupidity

Edit 2: Rather pointless now that I’ve discovered reason and provided fix/suggestion here are the tinymembench runs with dmc_ondemand governor:

taskset -c 0 /usr/local/src/tinymembench/tinymembench (last run differs a little)
taskset -c 4 /usr/local/src/tinymembench/tinymembench (all runs slow except of 3rd)
taskset -c 6 /usr/local/src/tinymembench/tinymembench (all runs slow)

So the partially random behaviour as well as poor performance is caused by the dmc governor

tkaiser · September 1, 2022, 6:30pm

While Geekbench is not a particularly good benchmark it’s popular.

Now let’s compare Rock 5B with dmc_ondemand and performance dmc governor:

https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17009078

With dmc_ondemand we also see a +10% performance drop. DRAM is clocked majority of time with just 528 MHz instead of 2112 MHz:

root@rock-5b:/sys/class/devfreq/dmc# cat governor 
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G

Average load and/or CPU utilization too high (too much background activity). Waiting...

Too busy for benchmarking: 19:39:41 up 1 min,  1 user,  load average: 0.23, 0.11, 0.04,  cpu: 4%
Too busy for benchmarking: 19:39:46 up 1 min,  1 user,  load average: 0.21, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:39:51 up 1 min,  1 user,  load average: 0.20, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:39:56 up 1 min,  1 user,  load average: 0.18, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:40:01 up 1 min,  1 user,  load average: 0.16, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:40:06 up 1 min,  1 user,  load average: 0.15, 0.10, 0.04,  cpu: 0%

sbc-bench v0.9.8 taking care of Geekbench

Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (22 minutes elapsed).

First run:

   Single-Core Score     586                 
   Crypto Score          790                 
   Integer Score         573                 
   Floating Point Score  580                 
   
   Multi-Core Score      2480               
   Crypto Score          3426               
   Integer Score         2364               
   Floating Point Score  2575               

Second run:

   Single-Core Score     585                 
   Crypto Score          757                 
   Integer Score         575                 
   Floating Point Score  578                 
   
   Multi-Core Score      2458               
   Crypto Score          3408               
   Integer Score         2322               
   Floating Point Score  2593               

https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17008732

Full results uploaded to http://ix.io/49o7. 

root@rock-5b:/sys/class/devfreq/dmc# cat trans_stat
     From  :   To
           : 528000000106800000015600000002112000000   time(ms)
* 528000000:         0         0         0       174   1336606
 1068000000:        83         0         0        88     42146
 1560000000:        31        39         0        17     13970
 2112000000:        61       132        87         0     70973
Total transition : 712
root@rock-5b:/sys/class/devfreq/dmc# echo performance >governor 
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G

Average load and/or CPU utilization too high (too much background activity). Waiting...

Too busy for benchmarking: 20:03:19 up 24 min,  2 users,  load average: 1.02, 2.52, 1.90,  cpu: 46%
Too busy for benchmarking: 20:03:24 up 24 min,  2 users,  load average: 0.94, 2.47, 1.89,  cpu: 0%
Too busy for benchmarking: 20:03:29 up 25 min,  2 users,  load average: 0.86, 2.43, 1.88,  cpu: 0%
Too busy for benchmarking: 20:03:34 up 25 min,  2 users,  load average: 0.79, 2.39, 1.87,  cpu: 0%
Too busy for benchmarking: 20:03:39 up 25 min,  2 users,  load average: 0.73, 2.35, 1.86,  cpu: 0%
Too busy for benchmarking: 20:03:44 up 25 min,  2 users,  load average: 0.67, 2.31, 1.85,  cpu: 0%
Too busy for benchmarking: 20:03:49 up 25 min,  2 users,  load average: 0.62, 2.27, 1.84,  cpu: 0%

sbc-bench v0.9.8 taking care of Geekbench

Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (20 minutes elapsed).

First run:

   Single-Core Score     669                 
   Crypto Score          849                 
   Integer Score         650                 
   Floating Point Score  681                 
   
   Multi-Core Score      2690               
   Crypto Score          3414               
   Integer Score         2612               
   Floating Point Score  2737               

Second run:

   Single-Core Score     669                 
   Crypto Score          844                 
   Integer Score         651                 
   Floating Point Score  680                 
   
   Multi-Core Score      2665               
   Crypto Score          3419               
   Integer Score         2574               
   Floating Point Score  2735               

https://browser.geekbench.com/v5/cpu/compare/17009035?baseline=17009078

Full results uploaded to http://ix.io/49od.

tkaiser · September 1, 2022, 7:24pm

And a final one. This is two times Geekbench with powersave dmc governor: https://browser.geekbench.com/v5/cpu/compare/17009643?baseline=17009700

Results variation much better than with dmc_ondemand.

And this is performance and powersave compared (or in other words: clocking DRAM with 2112 MHz vs. 528 MHz all the time): https://browser.geekbench.com/v5/cpu/compare/17009700?baseline=17009078

Provides some good insights about what Geekbench is actually doing. And also how dmc_ondemand at least with multi-threaded workloads often results in DRAM being clocked high if needed.

For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.

Another interesting data point would be the consumption difference performance vs. other dmc governors. Funnily I’ve both equipment and knowledge to do this

willy · September 1, 2022, 8:15pm

Nice, at least now there’s an explanation. I remember having searched for a long time on my first 3399. A53s could have faster DRAM access than A72, but only if two of them were used simultaneously, otherwise it was much slower, similarly due to the on_demand DMC scheduler that was too conservative. Switching it to performance only marginally increased power consumption but I remember that it was quite small, maybe in the order of 100mW or so, nothing that could justify using such a setting on anything not battery-powered.

tkaiser · September 1, 2022, 8:36pm

Could also be important for those guys working on GPU/VPU like @avaf or @icecream95 – when trying to draw conclusions about RK3588 DRAM clock could be important. So I hope they’re aware of what’s happening below /sys/class/devfreq/dmc since 2 weeks.

Meanwhile let’s answer questions of the clueless Armbian crowd: dark beer is standard and the best I’ve ever tasted was in Ljubljana 2 decades ago when I worked there from time to time. If you’re in Bavaria try at least Reutberger and Mooser Liesl.

tkaiser · September 1, 2022, 9:15pm

@hipboi Is this really what you wanted?

RadxaNaoki · September 2, 2022, 1:41am

there are some parameters here /sys/devices/platform/fb000000.gpu/devfreq/fb000000.gpu too.

RadxaNaoki · September 2, 2022, 1:48am

thank you so much for your very interesting feedback! it should be considered. we should use the good parameters for everyone.

thc013 · September 2, 2022, 10:42am

https://www.texels.nl/bieren

https://nl.latrappetrappist.com/

bavaria is also a dutch brand

and Heiniken is shit

avaf · September 2, 2022, 11:40am

I was planning to monitor this and npu.

Maybe create a very simple monitor (collect relevant info) and draw a chart in png at the final step, that’s the idea.
@tkaiser, i am following up on your findings, If i come up with something useful i push it to github.

i think @icecream95 started to reverse engineer the npu, maybe he can disclose some additional info about his findings.

icecream95 · September 3, 2022, 5:30am

I haven’t gotten very far with reversing the NPU, currently I’m still focusing on the GPU and it’s firmware.

(I’ve found that the MCU in the GPU runs at the same speed as the shader cores, so if anyone has a use for a Cortex-M7 clocked at 1 GHz which can access at least 1 GB of RAM through the GPU MMU…)

hipboi · September 3, 2022, 9:51am

I think you mean the DMC governor, right? Before we move the PD voltage negotiation to the u-boot, we will have to enable DMC to save power to make sure we have enough power booting to kernel
PD voltage negotiation.

hipboi · September 3, 2022, 9:53am

For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.

We did not send any developer edition to any Youtubers, only developers.

Stat_headcrabed · September 3, 2022, 5:35pm

Wow that’s insane… before this, the only chip with Cortex-M7 with such frequency I have ever seen is NXP’s i.mx rt1170.

tkaiser · September 3, 2022, 7:04pm

You might get more power savings by CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y (spent a lot of time on this half a decade ago when still contributing to Armbian, ofc the distro then needs to switch back to schedutil or ondemand in a later stage e.g. by configuring cpufrequtils or some radxa-tune-hardware service).

Speaking of schedutil vs. ondemand and I/O performance the choice is rather obvious: https://github.com/radxa/kernel/commit/55f540ce97a3d19330abea8a0afc0052ab2644ef#commitcomment-79484235

I/O performance sucks without either performance or ondemand combined with io_is_busy (and to be honest: Radxa’s (lack of) feedback sucks too).

My older script code rotting unmaintained in some Armbian service won’t do it any more (and they won’t change anything about it since not giving a sh*t about low-level optimisations).

Yeah, but once you send out review samples IMHO you should’ve fixed the performance issues. Both.

Currently the only other RK3588(S) vendor affected by trashed memory performance is Firefly. All the others have not discovered/enabled the dmc device-tree node so far.