ROCK 5B Debug Party Invitation

neggles · August 29, 2022, 11:18am

Ooh, I wonder if the Optane H10 (which uses x2x2 bifurcation) would work…

bearpaw · August 29, 2022, 12:15pm

It should 'cause it uses two x2 lanes and there is 2 clock lanes provided as was stated by @jack

Allen.Smithee · August 29, 2022, 11:42pm

Why ask the Radxa team to support an obsolete Intel technology that only works on Microsoft Windows?

neggles · August 30, 2022, 3:51am

It is neither obsolete nor intel-proprietary. You’re thinking of Intel’s “Optane Memory” consumer-focused software product, which is just a fairly mediocre implementation of a tiered cache for an SSD.

The Optane H10 drive is just a 16/32GB PCIe 3.0 x2 Optane SSD and a 512/1024/2048GB QLC NAND SSD on a single card. It doesn’t have to be used with Intel’s shitty software, you can put them in whatever system (as long as it supports x2x2 bifurcation on the M.2 slot) and use the two drives as independent storage volumes.

They work very well as a combined ZIL SLOG (Optane) and L2ARC (NAND) drive in a 2x10Gbps ZFS storage server/NAS - Optane is nearly unmatched when it comes to ZIL SLOG performance, only RAM-based drives beat it - but of course Intel never bothered to market it that way.

Off-topic:

Intel barely even bothered to market Optane at all - it’s not an inherently bad technology, it has many major advantages over NAND flash that do actually make it worth the money, but the only thing they really bothered to market was “Optane Memory”, which isn’t even really a thing - it’s just intel’s Rapid Storage Technology with an Optane cache drive and a brand name - and is atrocious. But that’s what people think of when you say Optane. Stupid.

Then they went all-in on Optane DCPMM (Optane chips on a DDR4 DIMM, and yes, it’s fast enough to almost keep up with DDR4), seeing it as a chance to get customers hooked on something that AMD couldn’t provide - but it’s an incredibly niche technology that’s not very useful outside of a few very specific use cases, and their decision to focus on it is what lead to Micron quitting the 3D XPoint (generic term for Optane) joint venture.

And now they’ve dropped 3D XPoint entirely just before PCIe 5.0 and CXL 2.0 would’ve given it a new lease on life. Another genuinely innovative, highly promising, and potentially revolutionary technology driven into the dirt by Intel’s obsession with finding ways to lock customers into their platforms, rather than just producing a product good enough that nobody wants to go with anyone else.

Anyway yeah that’s wildly off-topic, but the tl;dr is that “Optane Memory” is just shitty software, it is not representative of Optane as a whole, and an SBC with a 32/512 H10 drive in it could be good for quite a few things.

Dante4 · August 30, 2022, 9:13am

While using Rock5 as ZFS storage may be interesting idea (which I’m gonna test when I will get necessary parts) - don’t forget, that you don’t have ecc memory, which kinda defeats point of ZFS (unless you are going for compression&dedup). And without O_direct patch (3.0) - nvme performance is reaaaly bad

tkaiser · August 30, 2022, 10:36am

Nope, this is a long debunked myth originating from FreeNAS/TrueNAS forums: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

Quoting one of the ZFS designers, Matthew Ahrens:

There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.

neggles · August 30, 2022, 3:27pm

Yeah, there’s nothing about ZFS that mandates ECC memory any more than anything else. I also didn’t say anything about using ZFS on Rock5 though it would probably work quite well, even if there wouldn’t be any support for the hardware decompression engine.

NVMe performance is not “really bad” either - I’ve been running my main buildbox off ZFS-on-NVMe for about a year and a half now, it’s pretty ridiculously fast. I went from a full OpenWrt image build taking well over an hour (on 4x250GB SATA SSD in RAID 0) to 25 minutes (those same 4x250GB SSDs as single-disk ZFS vdevs) to less than 10 minutes on a single-drive NVMe ZFS pool. And it’s not even a particularly good NVMe drive… Sure, most of the boost there comes from ARC, but still.

ANYWAY. This is off-topic. Point being, once I get my hands on a Rock5 i’ll be sure to give the H10 a try

Dante4 · August 30, 2022, 7:54pm

Right now mdadm gives me 12 GB/s, while ZFS merely 3-4GB/s (7 NVMe drives). Not like it’s matter with rock5, but in total nvme performance is terrible.

And yes, that’s going off-topic.

tkaiser · August 31, 2022, 7:08pm

To whom it may concern… @lanefu @Tonymac32 ?

Distro	Clockspeed	Kernel	7-zip	AES-256 (16 KB)	memcpy	memset
Radxa Focal	2350/1830 MHz	5.10	16450	1337540	10830	29220
Armbian Focal	2350/1830 MHz	5.10	14670	1339400	10430	29130

A 11% drop in 7-ZIP MIPS (that is sensitive to memory latency).

Armbian image freshly built 2 days ago and Radxa image not updated wrt bootloader/kernel. On the Armbian image memory latency and bandwidth of the A55 cluster is completely trashed.

As such some stuff to babble about in your hidden Discord crap channels…

@willy: there are some discrepancies between tinymembench latency measurements and ramlat. The former clearly shows increased latency starting with 4M sizes. ramlat only shows on the cpu6 measurement a huge difference to the older benchmark made with Radxa’s OS image.

What am I’m doing wrong when calling ramlat?

willy · August 31, 2022, 8:32pm

I don’t think you’re doing anything wrong. Memory latencies are extremely dependent on the walk patten because latencies basically didn’t change in 30 years (never went lower than 50-60ns), and everything else is solely dependent on optimizing transfers on some access patterns. As such, in order to perform such measurements, tools like ramlat and tinymembench need to proceed along non-predictable walk patterns that don’t cost too much in CPU usage, to both are probably using different algos, resulting in different measurements.

However I’m very surprised by the comparison between the two images, because:

tinymembench sees CPU4 twice as slow on focal
ramlat sees CPU6 twice as slow on focal

I’m wondering if it’s not just that one cluster (or even one core) randomly gets punished during one of these tests. Maybe if you tun tinymembench 5 times and ramlat 5 times, both will agree on seeing that one cluster is regularly much slower, and either the affected cluster will always be the same for a given tool, possibly providing a hint about an access pattern triggering the problem, or maybe it will happen on different clusters in a random fashion. If you look at tinymembench on cluster 0, it got one measurement significantly off at 32 MB. That’s also a hint that there could be something not very stable during these measurements.

tkaiser · September 1, 2022, 5:28pm

I switched to performance cpufreq governor and then gave it a try:

ramlat:

Consistent results with the difference starting at 4M. With Radxa’s old/original image it looks like this:

  4096k: 56.92 43.23 51.38 42.95 51.56 44.68 46.86 59.27 
  8192k: 99.71 85.63 98.51 85.03 97.64 86.10 84.90 91.85 
 16384k: 119.4 119.1 120.0 109.2 118.5 109.4 109.6 107.9

And with the freshly built Armbian image (also including recent commits from Radxa’s kernel repo) it looks like this on the A76:

  4096k: 100.3 75.80 87.20 74.90 86.04 75.01 80.96 111.3 
  8192k: 176.2 156.5 170.9 155.1 172.0 150.6 157.8 182.2 
 16384k: 212.5 201.0 210.3 200.5 210.7 194.6 203.9 216.8

Most probably I already found the culprit since Radxa enabled dmc on Aug 22, 2022 :

root@rock-5b:/sys/class/devfreq/dmc# cat governor 
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# cat available_governors 
dmc_ondemand userspace powersave performance simple_ondemand

So to get performance back something like what I added years ago to a now unmaintained Armbian script might be needed…

Now performing tinymembench runs in a similar fashion but this will take ages. I’ll update this post later.

Edit: switching back to performance dmc governor restores initial memory latency / performance:

  4096k: 56.15 43.70 50.32 43.57 50.23 44.84 47.17 60.30 
  8192k: 102.2 86.42 98.15 86.13 98.31 86.02 85.37 93.44 
 16384k: 123.1 112.3 120.5 111.8 120.5 112.8 115.3 111.7

Meanwhile over at Armbian only the usual amount of ignorance and stupidity

Edit 2: Rather pointless now that I’ve discovered reason and provided fix/suggestion here are the tinymembench runs with dmc_ondemand governor:

taskset -c 0 /usr/local/src/tinymembench/tinymembench (last run differs a little)
taskset -c 4 /usr/local/src/tinymembench/tinymembench (all runs slow except of 3rd)
taskset -c 6 /usr/local/src/tinymembench/tinymembench (all runs slow)

So the partially random behaviour as well as poor performance is caused by the dmc governor

tkaiser · September 1, 2022, 6:30pm

While Geekbench is not a particularly good benchmark it’s popular.

Now let’s compare Rock 5B with dmc_ondemand and performance dmc governor:

https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17009078

With dmc_ondemand we also see a +10% performance drop. DRAM is clocked majority of time with just 528 MHz instead of 2112 MHz:

root@rock-5b:/sys/class/devfreq/dmc# cat governor 
dmc_ondemand
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G

Average load and/or CPU utilization too high (too much background activity). Waiting...

Too busy for benchmarking: 19:39:41 up 1 min,  1 user,  load average: 0.23, 0.11, 0.04,  cpu: 4%
Too busy for benchmarking: 19:39:46 up 1 min,  1 user,  load average: 0.21, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:39:51 up 1 min,  1 user,  load average: 0.20, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:39:56 up 1 min,  1 user,  load average: 0.18, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:40:01 up 1 min,  1 user,  load average: 0.16, 0.10, 0.04,  cpu: 0%
Too busy for benchmarking: 19:40:06 up 1 min,  1 user,  load average: 0.15, 0.10, 0.04,  cpu: 0%

sbc-bench v0.9.8 taking care of Geekbench

Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (22 minutes elapsed).

First run:

   Single-Core Score     586                 
   Crypto Score          790                 
   Integer Score         573                 
   Floating Point Score  580                 
   
   Multi-Core Score      2480               
   Crypto Score          3426               
   Integer Score         2364               
   Floating Point Score  2575               

Second run:

   Single-Core Score     585                 
   Crypto Score          757                 
   Integer Score         575                 
   Floating Point Score  578                 
   
   Multi-Core Score      2458               
   Crypto Score          3408               
   Integer Score         2322               
   Floating Point Score  2593               

https://browser.geekbench.com/v5/cpu/compare/17008686?baseline=17008732

Full results uploaded to http://ix.io/49o7. 

root@rock-5b:/sys/class/devfreq/dmc# cat trans_stat
     From  :   To
           : 528000000106800000015600000002112000000   time(ms)
* 528000000:         0         0         0       174   1336606
 1068000000:        83         0         0        88     42146
 1560000000:        31        39         0        17     13970
 2112000000:        61       132        87         0     70973
Total transition : 712
root@rock-5b:/sys/class/devfreq/dmc# echo performance >governor 
root@rock-5b:/sys/class/devfreq/dmc# sbc-bench.sh -G

Average load and/or CPU utilization too high (too much background activity). Waiting...

Too busy for benchmarking: 20:03:19 up 24 min,  2 users,  load average: 1.02, 2.52, 1.90,  cpu: 46%
Too busy for benchmarking: 20:03:24 up 24 min,  2 users,  load average: 0.94, 2.47, 1.89,  cpu: 0%
Too busy for benchmarking: 20:03:29 up 25 min,  2 users,  load average: 0.86, 2.43, 1.88,  cpu: 0%
Too busy for benchmarking: 20:03:34 up 25 min,  2 users,  load average: 0.79, 2.39, 1.87,  cpu: 0%
Too busy for benchmarking: 20:03:39 up 25 min,  2 users,  load average: 0.73, 2.35, 1.86,  cpu: 0%
Too busy for benchmarking: 20:03:44 up 25 min,  2 users,  load average: 0.67, 2.31, 1.85,  cpu: 0%
Too busy for benchmarking: 20:03:49 up 25 min,  2 users,  load average: 0.62, 2.27, 1.84,  cpu: 0%

sbc-bench v0.9.8 taking care of Geekbench

Installing needed tools: Done.
Checking cpufreq OPP. Done.
Executing RAM latency tester. Done.
Executing Geekbench. Done.
Checking cpufreq OPP. Done (20 minutes elapsed).

First run:

   Single-Core Score     669                 
   Crypto Score          849                 
   Integer Score         650                 
   Floating Point Score  681                 
   
   Multi-Core Score      2690               
   Crypto Score          3414               
   Integer Score         2612               
   Floating Point Score  2737               

Second run:

   Single-Core Score     669                 
   Crypto Score          844                 
   Integer Score         651                 
   Floating Point Score  680                 
   
   Multi-Core Score      2665               
   Crypto Score          3419               
   Integer Score         2574               
   Floating Point Score  2735               

https://browser.geekbench.com/v5/cpu/compare/17009035?baseline=17009078

Full results uploaded to http://ix.io/49od.

tkaiser · September 1, 2022, 7:24pm

And a final one. This is two times Geekbench with powersave dmc governor: https://browser.geekbench.com/v5/cpu/compare/17009643?baseline=17009700

Results variation much better than with dmc_ondemand.

And this is performance and powersave compared (or in other words: clocking DRAM with 2112 MHz vs. 528 MHz all the time): https://browser.geekbench.com/v5/cpu/compare/17009700?baseline=17009078

Provides some good insights about what Geekbench is actually doing. And also how dmc_ondemand at least with multi-threaded workloads often results in DRAM being clocked high if needed.

For Radxa (no idea whether they’re following this thread or just have a party in a hidden Discord channel) the results should be obvious especially when sending out review samples to those clueless YouTube clowns who will share performance numbers at least 10% less of what RK3588 is able to deliver.

Another interesting data point would be the consumption difference performance vs. other dmc governors. Funnily I’ve both equipment and knowledge to do this

willy · September 1, 2022, 8:15pm

Nice, at least now there’s an explanation. I remember having searched for a long time on my first 3399. A53s could have faster DRAM access than A72, but only if two of them were used simultaneously, otherwise it was much slower, similarly due to the on_demand DMC scheduler that was too conservative. Switching it to performance only marginally increased power consumption but I remember that it was quite small, maybe in the order of 100mW or so, nothing that could justify using such a setting on anything not battery-powered.

tkaiser · September 1, 2022, 8:36pm

Could also be important for those guys working on GPU/VPU like @avaf or @icecream95 – when trying to draw conclusions about RK3588 DRAM clock could be important. So I hope they’re aware of what’s happening below /sys/class/devfreq/dmc since 2 weeks.

Meanwhile let’s answer questions of the clueless Armbian crowd: dark beer is standard and the best I’ve ever tasted was in Ljubljana 2 decades ago when I worked there from time to time. If you’re in Bavaria try at least Reutberger and Mooser Liesl.

tkaiser · September 1, 2022, 9:15pm

@hipboi Is this really what you wanted?

RadxaNaoki · September 2, 2022, 1:41am

there are some parameters here /sys/devices/platform/fb000000.gpu/devfreq/fb000000.gpu too.

RadxaNaoki · September 2, 2022, 1:48am

thank you so much for your very interesting feedback! it should be considered. we should use the good parameters for everyone.

thc013 · September 2, 2022, 10:42am

https://www.texels.nl/bieren

https://nl.latrappetrappist.com/

bavaria is also a dutch brand

and Heiniken is shit

avaf · September 2, 2022, 11:40am

I was planning to monitor this and npu.

Maybe create a very simple monitor (collect relevant info) and draw a chart in png at the final step, that’s the idea.
@tkaiser, i am following up on your findings, If i come up with something useful i push it to github.

i think @icecream95 started to reverse engineer the npu, maybe he can disclose some additional info about his findings.