Nope, this is a long debunked myth originating from FreeNAS/TrueNAS forums: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/
Quoting one of the ZFS designers, Matthew Ahrens:
Nope, this is a long debunked myth originating from FreeNAS/TrueNAS forums: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/
Quoting one of the ZFS designers, Matthew Ahrens:
Yeah, there’s nothing about ZFS that mandates ECC memory any more than anything else. I also didn’t say anything about using ZFS on Rock5 though it would probably work quite well, even if there wouldn’t be any support for the hardware decompression engine.
NVMe performance is not “really bad” either - I’ve been running my main buildbox off ZFS-on-NVMe for about a year and a half now, it’s pretty ridiculously fast. I went from a full OpenWrt image build taking well over an hour (on 4x250GB SATA SSD in RAID 0) to 25 minutes (those same 4x250GB SSDs as single-disk ZFS vdevs) to less than 10 minutes on a single-drive NVMe ZFS pool. And it’s not even a particularly good NVMe drive… Sure, most of the boost there comes from ARC, but still.
ANYWAY. This is off-topic. Point being, once I get my hands on a Rock5 i’ll be sure to give the H10 a try
Right now mdadm gives me 12 GB/s, while ZFS merely 3-4GB/s (7 NVMe drives). Not like it’s matter with rock5, but in total nvme performance is terrible.
And yes, that’s going off-topic.
To whom it may concern… @lanefu @Tonymac32 ?
Distro | Clockspeed | Kernel | 7-zip | AES-256 (16 KB) | memcpy | memset |
---|---|---|---|---|---|---|
Radxa Focal | 2350/1830 MHz | 5.10 | 16450 | 1337540 | 10830 | 29220 |
Armbian Focal | 2350/1830 MHz | 5.10 | 14670 | 1339400 | 10430 | 29130 |
A 11% drop in 7-ZIP MIPS (that is sensitive to memory latency).
Armbian image freshly built 2 days ago and Radxa image not updated wrt bootloader/kernel. On the Armbian image memory latency and bandwidth of the A55 cluster is completely trashed.
As such some stuff to babble about in your hidden Discord crap channels…
@willy: there are some discrepancies between tinymembench
latency measurements and ramlat
. The former clearly shows increased latency starting with 4M sizes. ramlat
only shows on the cpu6
measurement a huge difference to the older benchmark made with Radxa’s OS image.
What am I’m doing wrong when calling ramlat
?
I don’t think you’re doing anything wrong. Memory latencies are extremely dependent on the walk patten because latencies basically didn’t change in 30 years (never went lower than 50-60ns), and everything else is solely dependent on optimizing transfers on some access patterns. As such, in order to perform such measurements, tools like ramlat and tinymembench need to proceed along non-predictable walk patterns that don’t cost too much in CPU usage, to both are probably using different algos, resulting in different measurements.
However I’m very surprised by the comparison between the two images, because:
I’m wondering if it’s not just that one cluster (or even one core) randomly gets punished during one of these tests. Maybe if you tun tinymembench 5 times and ramlat 5 times, both will agree on seeing that one cluster is regularly much slower, and either the affected cluster will always be the same for a given tool, possibly providing a hint about an access pattern triggering the problem, or maybe it will happen on different clusters in a random fashion. If you look at tinymembench on cluster 0, it got one measurement significantly off at 32 MB. That’s also a hint that there could be something not very stable during these measurements.
I switched to performance cpufreq governor and then gave it a try:
ramlat:
Consistent results with the difference starting at 4M. With Radxa’s old/original image it looks like this:
4096k: 56.92 43.23 51.38 42.95 51.56 44.68 46.86 59.27
8192k: 99.71 85.63 98.51 85.03 97.64 86.10 84.90 91.85
16384k: 119.4 119.1 120.0 109.2 118.5 109.4 109.6 107.9
And with the freshly built Armbian image (also including recent commits from Radxa’s kernel repo) it looks like this on the A76:
4096k: 100.3 75.80 87.20 74.90 86.04 75.01 80.96 111.3
8192k: 176.2 156.5 170.9 155.1 172.0 150.6 157.8 182.2
16384k: 212.5 201.0 210.3 200.5 210.7 194.6 203.9 216.8
Most probably I already found the culprit since Radxa enabled dmc on Aug 22, 2022 :
root@rock-5b:/sys/class/devfreq/dmc# cat governor
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# cat available_governors
dmc_ondemand userspace powersave performance simple_ondemand
So to get performance back something like what I added years ago to a now unmaintained Armbian script might be needed…
Now performing tinymembench
runs in a similar fashion but this will take ages. I’ll update this post later.
Edit: switching back to performance
dmc governor restores initial memory latency / performance:
4096k: 56.15 43.70 50.32 43.57 50.23 44.84 47.17 60.30
8192k: 102.2 86.42 98.15 86.13 98.31 86.02 85.37 93.44
16384k: 123.1 112.3 120.5 111.8 120.5 112.8 115.3 111.7
Meanwhile over at Armbian only the usual amount of ignorance and stupidity
Edit 2: Rather pointless now that I’ve discovered reason and provided fix/suggestion here are the tinymembench runs
with dmc_ondemand
governor:
So the partially random behaviour as well as poor performance is caused by the dmc governor
While Geekbench is not a particularly good benchmark it’s popular.
Now let’s compare Rock 5B with dmc_ondemand
and performance
dmc governor:
https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17009078
With dmc_ondemand
we also see a +10% performance drop. DRAM is clocked majority of time with just 528 MHz instead of 2112 MHz:
root@rock-5b:/sys/class/devfreq/dmc# cat governor
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G
Average load and/or CPU utilization too high (too much background activity). Waiting...
Too busy for benchmarking: 19:39:41 up 1 min, 1 user, load average: 0.23, 0.11, 0.04, cpu: 4%
Too busy for benchmarking: 19:39:46 up 1 min, 1 user, load average: 0.21, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:39:51 up 1 min, 1 user, load average: 0.20, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:39:56 up 1 min, 1 user, load average: 0.18, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:40:01 up 1 min, 1 user, load average: 0.16, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:40:06 up 1 min, 1 user, load average: 0.15, 0.10, 0.04, cpu: 0%
sbc-bench v0.9.8 taking care of Geekbench
Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (22 minutes elapsed).
First run:
Single-Core Score 586
Crypto Score 790
Integer Score 573
Floating Point Score 580
Multi-Core Score 2480
Crypto Score 3426
Integer Score 2364
Floating Point Score 2575
Second run:
Single-Core Score 585
Crypto Score 757
Integer Score 575
Floating Point Score 578
Multi-Core Score 2458
Crypto Score 3408
Integer Score 2322
Floating Point Score 2593
https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17008732
Full results uploaded to http://ix.io/49o7.
root@rock-5b:/sys/class/devfreq/dmc# cat trans_stat
From : To
: 528000000106800000015600000002112000000 time(ms)
* 528000000: 0 0 0 174 1336606
1068000000: 83 0 0 88 42146
1560000000: 31 39 0 17 13970
2112000000: 61 132 87 0 70973
Total transition : 712
root@rock-5b:/sys/class/devfreq/dmc# echo performance >governor
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G
Average load and/or CPU utilization too high (too much background activity). Waiting...
Too busy for benchmarking: 20:03:19 up 24 min, 2 users, load average: 1.02, 2.52, 1.90, cpu: 46%
Too busy for benchmarking: 20:03:24 up 24 min, 2 users, load average: 0.94, 2.47, 1.89, cpu: 0%
Too busy for benchmarking: 20:03:29 up 25 min, 2 users, load average: 0.86, 2.43, 1.88, cpu: 0%
Too busy for benchmarking: 20:03:34 up 25 min, 2 users, load average: 0.79, 2.39, 1.87, cpu: 0%
Too busy for benchmarking: 20:03:39 up 25 min, 2 users, load average: 0.73, 2.35, 1.86, cpu: 0%
Too busy for benchmarking: 20:03:44 up 25 min, 2 users, load average: 0.67, 2.31, 1.85, cpu: 0%
Too busy for benchmarking: 20:03:49 up 25 min, 2 users, load average: 0.62, 2.27, 1.84, cpu: 0%
sbc-bench v0.9.8 taking care of Geekbench
Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (20 minutes elapsed).
First run:
Single-Core Score 669
Crypto Score 849
Integer Score 650
Floating Point Score 681
Multi-Core Score 2690
Crypto Score 3414
Integer Score 2612
Floating Point Score 2737
Second run:
Single-Core Score 669
Crypto Score 844
Integer Score 651
Floating Point Score 680
Multi-Core Score 2665
Crypto Score 3419
Integer Score 2574
Floating Point Score 2735
https://browser.geekbench.com/v5/cpu/compare/17009035?baseline=17009078
Full results uploaded to http://ix.io/49od.
And a final one. This is two times Geekbench with powersave
dmc governor: https://browser.geekbench.com/v5/cpu/compare/17009643?baseline=17009700
Results variation much better than with dmc_ondemand
.
And this is performance
and powersave
compared (or in other words: clocking DRAM with 2112 MHz vs. 528 MHz all the time): https://browser.geekbench.com/v5/cpu/compare/17009700?baseline=17009078
Provides some good insights about what Geekbench is actually doing. And also how dmc_ondemand
at least with multi-threaded workloads often results in DRAM being clocked high if needed.
For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.
Another interesting data point would be the consumption difference performance
vs. other dmc governors. Funnily I’ve both equipment and knowledge to do this
Nice, at least now there’s an explanation. I remember having searched for a long time on my first 3399. A53s could have faster DRAM access than A72, but only if two of them were used simultaneously, otherwise it was much slower, similarly due to the on_demand DMC scheduler that was too conservative. Switching it to performance only marginally increased power consumption but I remember that it was quite small, maybe in the order of 100mW or so, nothing that could justify using such a setting on anything not battery-powered.
Could also be important for those guys working on GPU/VPU like @avaf or @icecream95 – when trying to draw conclusions about RK3588 DRAM clock could be important. So I hope they’re aware of what’s happening below /sys/class/devfreq/dmc
since 2 weeks.
Meanwhile let’s answer questions of the clueless Armbian crowd: dark beer is standard and the best I’ve ever tasted was in Ljubljana 2 decades ago when I worked there from time to time. If you’re in Bavaria try at least Reutberger and Mooser Liesl.
there are some parameters here /sys/devices/platform/fb000000.gpu/devfreq/fb000000.gpu
too.
thank you so much for your very interesting feedback! it should be considered. we should use the good parameters for everyone.
I was planning to monitor this and npu.
Maybe create a very simple monitor (collect relevant info) and draw a chart in png at the final step, that’s the idea.
@tkaiser, i am following up on your findings, If i come up with something useful i push it to github.
i think @icecream95 started to reverse engineer the npu, maybe he can disclose some additional info about his findings.
I haven’t gotten very far with reversing the NPU, currently I’m still focusing on the GPU and it’s firmware.
(I’ve found that the MCU in the GPU runs at the same speed as the shader cores, so if anyone has a use for a Cortex-M7 clocked at 1 GHz which can access at least 1 GB of RAM through the GPU MMU…)
I think you mean the DMC governor, right? Before we move the PD voltage negotiation to the u-boot, we will have to enable DMC to save power to make sure we have enough power booting to kernel
PD voltage negotiation.
For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.
We did not send any developer edition to any Youtubers, only developers.
Wow that’s insane… before this, the only chip with Cortex-M7 with such frequency I have ever seen is NXP’s i.mx rt1170.
You might get more power savings by CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y
(spent a lot of time on this half a decade ago when still contributing to Armbian, ofc the distro then needs to switch back to schedutil
or ondemand
in a later stage e.g. by configuring cpufrequtils
or some radxa-tune-hardware
service).
Speaking of schedutil
vs. ondemand
and I/O performance the choice is rather obvious: https://github.com/radxa/kernel/commit/55f540ce97a3d19330abea8a0afc0052ab2644ef#commitcomment-79484235
I/O performance sucks without either performance
or ondemand
combined with io_is_busy
(and to be honest: Radxa’s (lack of) feedback sucks too).
My older script code rotting unmaintained in some Armbian service won’t do it any more (and they won’t change anything about it since not giving a sh*t about low-level optimisations).
Yeah, but once you send out review samples IMHO you should’ve fixed the performance issues. Both.
Currently the only other RK3588(S) vendor affected by trashed memory performance is Firefly. All the others have not discovered/enabled the dmc
device-tree node so far.