Ooh, I wonder if the Optane H10 (which uses x2x2 bifurcation) would work…
ROCK 5B Debug Party Invitation
Why ask the Radxa team to support an obsolete Intel technology that only works on Microsoft Windows?
It is neither obsolete nor intel-proprietary. You’re thinking of Intel’s “Optane Memory” consumer-focused software product, which is just a fairly mediocre implementation of a tiered cache for an SSD.
The Optane H10 drive is just a 16/32GB PCIe 3.0 x2 Optane SSD and a 512/1024/2048GB QLC NAND SSD on a single card. It doesn’t have to be used with Intel’s shitty software, you can put them in whatever system (as long as it supports x2x2 bifurcation on the M.2 slot) and use the two drives as independent storage volumes.
They work very well as a combined ZIL SLOG (Optane) and L2ARC (NAND) drive in a 2x10Gbps ZFS storage server/NAS - Optane is nearly unmatched when it comes to ZIL SLOG performance, only RAM-based drives beat it - but of course Intel never bothered to market it that way.
Off-topic:
Intel barely even bothered to market Optane at all - it’s not an inherently bad technology, it has many major advantages over NAND flash that do actually make it worth the money, but the only thing they really bothered to market was “Optane Memory”, which isn’t even really a thing - it’s just intel’s Rapid Storage Technology with an Optane cache drive and a brand name - and is atrocious. But that’s what people think of when you say Optane. Stupid.
Then they went all-in on Optane DCPMM (Optane chips on a DDR4 DIMM, and yes, it’s fast enough to almost keep up with DDR4), seeing it as a chance to get customers hooked on something that AMD couldn’t provide - but it’s an incredibly niche technology that’s not very useful outside of a few very specific use cases, and their decision to focus on it is what lead to Micron quitting the 3D XPoint (generic term for Optane) joint venture.
And now they’ve dropped 3D XPoint entirely just before PCIe 5.0 and CXL 2.0 would’ve given it a new lease on life. Another genuinely innovative, highly promising, and potentially revolutionary technology driven into the dirt by Intel’s obsession with finding ways to lock customers into their platforms, rather than just producing a product good enough that nobody wants to go with anyone else.
Anyway yeah that’s wildly off-topic, but the tl;dr is that “Optane Memory” is just shitty software, it is not representative of Optane as a whole, and an SBC with a 32/512 H10 drive in it could be good for quite a few things.
While using Rock5 as ZFS storage may be interesting idea (which I’m gonna test when I will get necessary parts) - don’t forget, that you don’t have ecc memory, which kinda defeats point of ZFS (unless you are going for compression&dedup). And without O_direct patch (3.0) - nvme performance is reaaaly bad
Nope, this is a long debunked myth originating from FreeNAS/TrueNAS forums: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/
Quoting one of the ZFS designers, Matthew Ahrens:
Yeah, there’s nothing about ZFS that mandates ECC memory any more than anything else. I also didn’t say anything about using ZFS on Rock5 though it would probably work quite well, even if there wouldn’t be any support for the hardware decompression engine.
NVMe performance is not “really bad” either - I’ve been running my main buildbox off ZFS-on-NVMe for about a year and a half now, it’s pretty ridiculously fast. I went from a full OpenWrt image build taking well over an hour (on 4x250GB SATA SSD in RAID 0) to 25 minutes (those same 4x250GB SSDs as single-disk ZFS vdevs) to less than 10 minutes on a single-drive NVMe ZFS pool. And it’s not even a particularly good NVMe drive… Sure, most of the boost there comes from ARC, but still.
ANYWAY. This is off-topic. Point being, once I get my hands on a Rock5 i’ll be sure to give the H10 a try
Right now mdadm gives me 12 GB/s, while ZFS merely 3-4GB/s (7 NVMe drives). Not like it’s matter with rock5, but in total nvme performance is terrible.
And yes, that’s going off-topic.
To whom it may concern… @lanefu @Tonymac32 ?
Distro | Clockspeed | Kernel | 7-zip | AES-256 (16 KB) | memcpy | memset |
---|---|---|---|---|---|---|
Radxa Focal | 2350/1830 MHz | 5.10 | 16450 | 1337540 | 10830 | 29220 |
Armbian Focal | 2350/1830 MHz | 5.10 | 14670 | 1339400 | 10430 | 29130 |
A 11% drop in 7-ZIP MIPS (that is sensitive to memory latency).
Armbian image freshly built 2 days ago and Radxa image not updated wrt bootloader/kernel. On the Armbian image memory latency and bandwidth of the A55 cluster is completely trashed.
As such some stuff to babble about in your hidden Discord crap channels…
@willy: there are some discrepancies between tinymembench
latency measurements and ramlat
. The former clearly shows increased latency starting with 4M sizes. ramlat
only shows on the cpu6
measurement a huge difference to the older benchmark made with Radxa’s OS image.
What am I’m doing wrong when calling ramlat
?
I don’t think you’re doing anything wrong. Memory latencies are extremely dependent on the walk patten because latencies basically didn’t change in 30 years (never went lower than 50-60ns), and everything else is solely dependent on optimizing transfers on some access patterns. As such, in order to perform such measurements, tools like ramlat and tinymembench need to proceed along non-predictable walk patterns that don’t cost too much in CPU usage, to both are probably using different algos, resulting in different measurements.
However I’m very surprised by the comparison between the two images, because:
- tinymembench sees CPU4 twice as slow on focal
- ramlat sees CPU6 twice as slow on focal
I’m wondering if it’s not just that one cluster (or even one core) randomly gets punished during one of these tests. Maybe if you tun tinymembench 5 times and ramlat 5 times, both will agree on seeing that one cluster is regularly much slower, and either the affected cluster will always be the same for a given tool, possibly providing a hint about an access pattern triggering the problem, or maybe it will happen on different clusters in a random fashion. If you look at tinymembench on cluster 0, it got one measurement significantly off at 32 MB. That’s also a hint that there could be something not very stable during these measurements.
I switched to performance cpufreq governor and then gave it a try:
ramlat:
- taskset -c 0 /usr/local/src/ramspeed/ramlat -s -n 200
- taskset -c 4 /usr/local/src/ramspeed/ramlat -s -n 200
- taskset -c 6 /usr/local/src/ramspeed/ramlat -s -n 200
Consistent results with the difference starting at 4M. With Radxa’s old/original image it looks like this:
4096k: 56.92 43.23 51.38 42.95 51.56 44.68 46.86 59.27
8192k: 99.71 85.63 98.51 85.03 97.64 86.10 84.90 91.85
16384k: 119.4 119.1 120.0 109.2 118.5 109.4 109.6 107.9
And with the freshly built Armbian image (also including recent commits from Radxa’s kernel repo) it looks like this on the A76:
4096k: 100.3 75.80 87.20 74.90 86.04 75.01 80.96 111.3
8192k: 176.2 156.5 170.9 155.1 172.0 150.6 157.8 182.2
16384k: 212.5 201.0 210.3 200.5 210.7 194.6 203.9 216.8
Most probably I already found the culprit since Radxa enabled dmc on Aug 22, 2022 :
root@rock-5b:/sys/class/devfreq/dmc# cat governor
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# cat available_governors
dmc_ondemand userspace powersave performance simple_ondemand
So to get performance back something like what I added years ago to a now unmaintained Armbian script might be needed…
Now performing tinymembench
runs in a similar fashion but this will take ages. I’ll update this post later.
Edit: switching back to performance
dmc governor restores initial memory latency / performance:
4096k: 56.15 43.70 50.32 43.57 50.23 44.84 47.17 60.30
8192k: 102.2 86.42 98.15 86.13 98.31 86.02 85.37 93.44
16384k: 123.1 112.3 120.5 111.8 120.5 112.8 115.3 111.7
Meanwhile over at Armbian only the usual amount of ignorance and stupidity
Edit 2: Rather pointless now that I’ve discovered reason and provided fix/suggestion here are the tinymembench runs
with dmc_ondemand
governor:
- taskset -c 0 /usr/local/src/tinymembench/tinymembench (last run differs a little)
- taskset -c 4 /usr/local/src/tinymembench/tinymembench (all runs slow except of 3rd)
- taskset -c 6 /usr/local/src/tinymembench/tinymembench (all runs slow)
So the partially random behaviour as well as poor performance is caused by the dmc governor
While Geekbench is not a particularly good benchmark it’s popular.
Now let’s compare Rock 5B with dmc_ondemand
and performance
dmc governor:
https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17009078
With dmc_ondemand
we also see a +10% performance drop. DRAM is clocked majority of time with just 528 MHz instead of 2112 MHz:
root@rock-5b:/sys/class/devfreq/dmc# cat governor
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G
Average load and/or CPU utilization too high (too much background activity). Waiting...
Too busy for benchmarking: 19:39:41 up 1 min, 1 user, load average: 0.23, 0.11, 0.04, cpu: 4%
Too busy for benchmarking: 19:39:46 up 1 min, 1 user, load average: 0.21, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:39:51 up 1 min, 1 user, load average: 0.20, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:39:56 up 1 min, 1 user, load average: 0.18, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:40:01 up 1 min, 1 user, load average: 0.16, 0.10, 0.04, cpu: 0%
Too busy for benchmarking: 19:40:06 up 1 min, 1 user, load average: 0.15, 0.10, 0.04, cpu: 0%
sbc-bench v0.9.8 taking care of Geekbench
Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (22 minutes elapsed).
First run:
Single-Core Score 586
Crypto Score 790
Integer Score 573
Floating Point Score 580
Multi-Core Score 2480
Crypto Score 3426
Integer Score 2364
Floating Point Score 2575
Second run:
Single-Core Score 585
Crypto Score 757
Integer Score 575
Floating Point Score 578
Multi-Core Score 2458
Crypto Score 3408
Integer Score 2322
Floating Point Score 2593
https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17008732
Full results uploaded to http://ix.io/49o7.
root@rock-5b:/sys/class/devfreq/dmc# cat trans_stat
From : To
: 528000000106800000015600000002112000000 time(ms)
* 528000000: 0 0 0 174 1336606
1068000000: 83 0 0 88 42146
1560000000: 31 39 0 17 13970
2112000000: 61 132 87 0 70973
Total transition : 712
root@rock-5b:/sys/class/devfreq/dmc# echo performance >governor
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G
Average load and/or CPU utilization too high (too much background activity). Waiting...
Too busy for benchmarking: 20:03:19 up 24 min, 2 users, load average: 1.02, 2.52, 1.90, cpu: 46%
Too busy for benchmarking: 20:03:24 up 24 min, 2 users, load average: 0.94, 2.47, 1.89, cpu: 0%
Too busy for benchmarking: 20:03:29 up 25 min, 2 users, load average: 0.86, 2.43, 1.88, cpu: 0%
Too busy for benchmarking: 20:03:34 up 25 min, 2 users, load average: 0.79, 2.39, 1.87, cpu: 0%
Too busy for benchmarking: 20:03:39 up 25 min, 2 users, load average: 0.73, 2.35, 1.86, cpu: 0%
Too busy for benchmarking: 20:03:44 up 25 min, 2 users, load average: 0.67, 2.31, 1.85, cpu: 0%
Too busy for benchmarking: 20:03:49 up 25 min, 2 users, load average: 0.62, 2.27, 1.84, cpu: 0%
sbc-bench v0.9.8 taking care of Geekbench
Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (20 minutes elapsed).
First run:
Single-Core Score 669
Crypto Score 849
Integer Score 650
Floating Point Score 681
Multi-Core Score 2690
Crypto Score 3414
Integer Score 2612
Floating Point Score 2737
Second run:
Single-Core Score 669
Crypto Score 844
Integer Score 651
Floating Point Score 680
Multi-Core Score 2665
Crypto Score 3419
Integer Score 2574
Floating Point Score 2735
https://browser.geekbench.com/v5/cpu/compare/17009035?baseline=17009078
Full results uploaded to http://ix.io/49od.
And a final one. This is two times Geekbench with powersave
dmc governor: https://browser.geekbench.com/v5/cpu/compare/17009643?baseline=17009700
Results variation much better than with dmc_ondemand
.
And this is performance
and powersave
compared (or in other words: clocking DRAM with 2112 MHz vs. 528 MHz all the time): https://browser.geekbench.com/v5/cpu/compare/17009700?baseline=17009078
Provides some good insights about what Geekbench is actually doing. And also how dmc_ondemand
at least with multi-threaded workloads often results in DRAM being clocked high if needed.
For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.
Another interesting data point would be the consumption difference performance
vs. other dmc governors. Funnily I’ve both equipment and knowledge to do this
Nice, at least now there’s an explanation. I remember having searched for a long time on my first 3399. A53s could have faster DRAM access than A72, but only if two of them were used simultaneously, otherwise it was much slower, similarly due to the on_demand DMC scheduler that was too conservative. Switching it to performance only marginally increased power consumption but I remember that it was quite small, maybe in the order of 100mW or so, nothing that could justify using such a setting on anything not battery-powered.
Could also be important for those guys working on GPU/VPU like @avaf or @icecream95 – when trying to draw conclusions about RK3588 DRAM clock could be important. So I hope they’re aware of what’s happening below /sys/class/devfreq/dmc
since 2 weeks.
Meanwhile let’s answer questions of the clueless Armbian crowd: dark beer is standard and the best I’ve ever tasted was in Ljubljana 2 decades ago when I worked there from time to time. If you’re in Bavaria try at least Reutberger and Mooser Liesl.
there are some parameters here /sys/devices/platform/fb000000.gpu/devfreq/fb000000.gpu
too.
thank you so much for your very interesting feedback! it should be considered. we should use the good parameters for everyone.
I was planning to monitor this and npu.
Maybe create a very simple monitor (collect relevant info) and draw a chart in png at the final step, that’s the idea.
@tkaiser, i am following up on your findings, If i come up with something useful i push it to github.
i think @icecream95 started to reverse engineer the npu, maybe he can disclose some additional info about his findings.