ROCK 5 in ITX form factor

tkaiser · May 18, 2024, 6:09am

Well, this is stuff for end users. It seems they simply made an Electron app to interact with the user and to setup things which is accessible via display and since being Electron might simply work over a network connection too.

And in case you still want to try local access maybe give ps:ps a try? But how to continue then? Accessing http://127.0.0.1 with a text browser?

Edit: Maybe executing roobi might then work (see here)

willy · May 18, 2024, 6:44am

I understand this but it doesn’t mean that those who know what they want to setup should go via the complicated path. It’s like when some distros didn’t let you run fdisk and/or would always reorder your partitions. On spinning rust you’d definitely want to place the swap closest to the most commonly used data and you’d end up with swap at the end, making the system super slow due to large seeks. I remember finding myself having to install in two steps using a secondary disk to later move the data.
I’ll give the ps:ps a try BTW, thanks for spotting this.

Last point regarding end users, end users install their PC by booting using a USB stick plugged into a USB port, not by using whatever proprietary install system. For example I had no trouble installing my LX2K that boots any regular distro from UEFI. There’s an EDK2 port for RK3588 that also covers Rock5B, I never tested it but it could be a more universal and conventional boot and setup method that targets end users.

tkaiser · May 18, 2024, 6:55am

Correct. And ‘their PC’ is x86.

In this stupid ARM world where each and every SoC needs its own bootloader (at least proprietary DRAM initialization) this simply doesn’t work which is the reason various SBC vendors came up with the idea of preflashing stuff like OOWOW, Roobi or whatever to present the user a list of OS images that contain all the proprietary crap necessary to boot on this specific device.

And this won’t change unless all ARM SoC and SBC vendors start to adopt Arm SystemReady (or whatever the current spec is called that usually has zero relevance for end users since nobody adopts anything of this) and preflash a SPI NOR with a bootloader (already containing all the proprietary crap needed for this specific device) being able to boot generic aarch64 OS images.

willy · May 18, 2024, 7:29am

I agree and that’s exactly my point. That’s also why I asked FriendlyELEC to install the SPI NOR on their latest NanoPC-T6 and to flash it (though I don’t think this last point has been done yet). As a PC user I don’t like UEFI because it’s more complicated than the legacy mode without adding visible value. But for ARM it’s the least painful solution that works for everyone out of the box, and the code exists for RK3588.

willy · May 18, 2024, 7:38am

BTW, ps:ps worked fine on the login prompt (with some script syntax errors but that’s all):

roobi login: ps
Password: 
Linux roobi 5.10.110-33-rockchip #65700d485 SMP Wed Apr 3 04:26:57 UTC 2024 aarch64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri May 17 19:29:29 UTC 2024 on tty1
-bash: [: : integer expression expected
-bash: [: : integer expression expected
-bash: [: : integer expression expected
ps@roobi:~$

I also found that the board indeed got a DHCP address, and sending a browser to it indeed shows the roobi interface. Another point that is not great for on-the-table setup is this:

Due to remote access to this device, authentication is required. After clicking the start button,
please press the power button three times within 60 seconds to complete the verification.

/me using the screwdriver again to short pins.

It then allows you to choose an OS image among two, and claims “the medium is not detected” (even when I click on the refresh circle on the right). This is exactly the reason I absolutely hate and despise such proprietary installation tools. Nobody uses them, so they are extremely lightly tested and full of bugs and limitations that make them a total pain to deal with.

Bah, for now I’ll run my tests from this image. It’s already a full-fledged OS, even gcc is installed on it! I don’t know what the intent was for a pre-installation image but it will be useful anyway

tkaiser · May 18, 2024, 7:56am

Is there an installation medium (other than eMMC which should not be user-accessible on Rock 5 ITX by default anyway since it’s Roobi’s place to live)?

willy · May 18, 2024, 8:15am

No, I have not yet added any other one, I naturally expected such a board to have its data on either SATA or M2 and the OS on the eMMC, I mean, what anyone would likely expect on a server board! Sacrifying the eMMC for a boot loader is non-sense, particularly when that means that you’ll either need to move your OS to the data disks or dedicate one drive for the OS. I really don’t understand such baroque choices. If the goal is to make it as irritating as using a Synology NAS I can understand, but I’m not seeing how one sees value in doing that :-/

I’m starting to get a feeling that the board is really awesomely designed from a hardware perspective, maybe one of the best product from Radxa to date, and that some incomprehensible software choices are going to hinder its adoption by spreading words of pain of setup and use :-/ I’m a bit confused.

willy · May 18, 2024, 8:25am

Also, there’s a micro-SD slot on the board that supports UHS mode and reads at 88MB/s from a 64G SD I have here. If there’s one please where the Roobi gadget should be placed, it’s exactly on a micro-SD. Everyone will easily find a spare one, download the Roobi image, place it on the SD and install the OS from there, then eject the SD once finished. No need to sacrify the eMMC for that single-use stuff. And I would strongly prefer to install the OS on an eMMC than on a micro-SD.

tkaiser · May 18, 2024, 8:46am

Nope, users fail to select the right OS image (please note: there’s not one Roobi OS install but these are device specific since the SBC vendor doesn’t ship with SPI NOR preflashed with something that would allow to boot generic aarch64 images).

And (even most experienced) users especially with Radxa are not even able to find the images for a specific device.

As for the target audience: Radxa advertises this thing as ‘ARM PC’ and desktop users want their OS on a NVMe SSD since they have heard that MB/s are that great and that important (nobody in that world understands the difference between random and sequential I/O). The average desktop user in front of such an ‘ARM PC’ will never notice there’s eMMC inside since it only contains some ‘firmware’ allowing the OS image of choice to be flashed to an installed SSD in the M.2 slot.

willy · May 18, 2024, 9:08am

For desktop I agree that you want to install the OS on M.2. For a server with no other PCIe slot you’ll want to use M.2 to connect PCIe devices (extra SATA slots or network cards).

And it doesn’t change the fact that Radxa could provide the roobi SD image for this board as they used to do in the past for their other OS images. Again, I’m fine with providing a user-friendly installer, but not at the cost of removing the only viable OS storage of the board. Right now I’m running off this distro which is going to be the main OS of this platform. This looks like a particularly weird and ridiculous situation where every storage device was shifted one place:

SPI NOR: empty, not used
eMMC: should contain the main OS, instead contains u-boot and this roobi gadget that takes all the room left (7G partition dedicated to this!!!)
M2: no longer usable for PCIe / SATA if you’re forced to move your OS to an NVME SSD due to this roobi you’ll never ever use again that steals your eMMC.
SD: not used during the install process
USB: not used during the install process

A correct setup would be:

SPI NOR: boot loader (u-boot, edk2, whatever etc)
eMMC: main OS. Optionally reserve a few MB for a recovery OS if the NOR is too tight.
SATA: data storage
M.2: either main data storage (SSD), complementary data storage (SATA), network, other?
SD/USB: usable during boot to plug either a standard installation image for the main OS (ubuntu, freebsd, maybe even windows, I don’t know), and usable as well for reinstallation by using the user-friendly roobi installer.

And what’s sad is that everything is present and properly wired on the product for this, it’s just totally mixed up at the software level!

willy · May 18, 2024, 11:27am

I’ve conducted some comparative tests to measure the effect of the different DRAM generation. For this I’ve run llama.cpp on the Rock5B, the Rock5 ITX (under roobi), and ADLINK’s AADK based on an Ampere Altra Q80-26 (80 cores at 2.6 GHz). The Altra uses Neoverse-N1 but it’s exactly the same core as A76. LLMs are interesting because they’re often limited by the memory bandwidth during generation. Since my Rock5B has 4GB RAM, I’ve used the Phi3-3B model quantized at Q6_K (3.1 GB) and a small context of 512 tokens. I’m only using the big cores for this test.

The Rock5B has its two big clusters running at 2256 and 2272 MHz respectively (hence 2264 avg). It uses LPDDR4x at 4224 MT/s. It parses at 5.58 tokens/s and produces 4.63 tokens/s.
The Rock5 ITX has its two big clusters at 2287 and 2223 MHz respectively (2255 avg), and LPDDR5 at 5472 MT/s. It parses at 5.71 tokens/s and produces 4.85 tokens/s, hence 4.7% faster generation for 0.4% lower CPU frequency.
The Rock5 ITX with only 3 threads instead of 4 drops to 3.80 t/s generation, above the theoretical 3.64 if we were CPU-bound, proving that the DRAM B/W is already the limiting factor when running under 4 threads.
The Altra limited to 4 threads has its cores running at 2600 MHz and 6 single-DIMM 64-bit DDR4 channels at 2933 MT/s. It parses at 7.19 tokens/s and produces 5.96 tokens/s. Hence it’s respectively 28.8 and 28.7% faster than the 5B for 14.8% higher CPU frequency and 4.17x higher memory bandwidth.
The highest generation speed the Altra reaches is around 40 threads at 22.80 tokens/s, or 3.8 times faster than with 4 threads, 4.92 times faster than Rock 5B or 4.7 times faster than Rock 5 ITX.

This shows that 6*2933 MT/s has a hard upper bound of 22.8 tok/s. This fixes an upper bound of 5.47 t/s for the Rock5B’s 4224 MT/s RAM and 7.09 t/s for the Rock5 ITX’s 5472 MT/s. Of course the CPUs also play a limiting factor here, but we’ve shown above that DRAM counts for the 4-thread test. If we perform a quick ratio calculation, 4.85/22.8*6*2933 shows that the Rock5-ITX delivers as if it was running DDR4 RAM at 3743 MT/s or 1872 MHz, and Rock5B as if it was at just 1800 MHz or DDR4-3600 (but again CPU does count here).

I would genuinely have expected a slightly higher gain between 5B and ITX (maybe 10-15%), but I’ve read above that there’s still this pending question about why LPDDR5 is not that much faster. With that said, it still is slightly faster (4.7% for a 0.4% slower CPU) but not by much.

Regardless that remains very good performance and it should be sufficient for most use cases. But anything we find to make LPDDR5 perform significantly better than LPDDR4X would be welcome I guess.

For those interested in reproducing these tests, I’ve used tag b2918 of llama.cpp, with Phi-3-mini-128k-instruct.Q6_K.gguf. The command line and (trimmed) output are:

willy@roobi:~/llama.cpp$ time taskset -c 4-7 ./main -c 512 -s 1 --temp 0.1 -n -1 --threads 4  -m ../models/Phi-3-mini-128k-instruct.Q6_K.gguf -e -p "<|im_start|>system\nYou're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>\n<|im_start|>user\nPlease explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant\n"
(...)
<s> <|im_start|>system
You're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>
<|im_start|>user
Please explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant
Running a Large Language Model (LLM) on a low-power processor like the Cortex A76 or Neoverse-N1 could have several potential benefits.

1. **Energy Efficiency**: Low-power processors are designed to consume less power, which can lead to significant energy savings. This is particularly beneficial in large-scale deployments where energy consumption can be a major concern.

2. **Cost Savings**: Lower power consumption translates to lower energy costs. This can result in significant cost savings, especially in large-scale deployments.

3. **Environmental Impact**: Lower energy consumption also means a reduced environmental impact. This is particularly important in the context of climate change and the global effort to reduce carbon emissions.

4. **Heat Generation**: Lower power processors generate less heat. This can reduce the need for cooling systems, which can further reduce energy consumption and costs.

5. **Performance**: While it's important to note that low-power processors may not offer the same level of performance as high-power processors, they can still provide adequate performance for many applications.

In conclusion, running an LLM on a low-power processor can offer benefits in terms of energy efficiency, cost savings, environmental impact, heat generation, and adequate performance. However, the specific benefits would depend on the specific requirements and constraints of the application.<|endoftext|> [end of text]
(...)
system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

llama_print_timings:        load time =     842.03 ms
llama_print_timings:      sample time =      17.60 ms /   342 runs   (    0.05 ms per token, 19431.82 tokens per second)
llama_print_timings: prompt eval time =   18920.53 ms /   108 tokens (  175.19 ms per token,     5.71 tokens per second)
llama_print_timings:        eval time =   70334.79 ms /   341 runs   (  206.26 ms per token,     4.85 tokens per second)
llama_print_timings:       total time =   89350.71 ms /   449 tokens

willy · May 18, 2024, 12:00pm

And for those wondering how x86 would behave here, the N5105 in my Odroid-H3 with its 4 cores at 2.8 GHz has two DDR4-3200 DIMMs. It delivers only 1.85 tokens/s on this test, or only 38% of the Rock5-ITX! In this case it’s likely that the lack of AVX counts. But given that newer CPUs such as N100 have cut their DRAM bandwidth in half to segment their market, they won’t do much better than the Rock5 here anyway. This, I think, makes RK3588 closer to recent x86 chips which have been purposely castrated, and it means that Rock5-ITX very likely has some chances to be compared to other PC motherboards for various setups.

Here are the raw numbers from this test:

system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
(...)
llama_print_timings:        load time =    1400.98 ms
llama_print_timings:      sample time =      15.33 ms /   309 runs   (    0.05 ms per token, 20161.82 tokens per second)
llama_print_timings: prompt eval time =   49784.80 ms /   108 tokens (  460.97 ms per token,     2.17 tokens per second)
llama_print_timings:        eval time =  166469.62 ms /   308 runs   (  540.49 ms per token,     1.85 tokens per second)
llama_print_timings:       total time =  216500.45 ms /   416 tokens

tkaiser · May 18, 2024, 7:47pm

Can you elaborate on that please? I know there’s this single vs. dual channel ‘issue’ but according to tinymembench scores your N5105 is outperformed by an N100: https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md

willy · May 18, 2024, 8:00pm

Bah never mind, I wrongly used the term “bandwidth” instead of the term “bus width”. BTW N100 is faster but not by a large margin, and having one channel will not let it issue two parallel fetches at once. It’s also interesting to see that some of the rock5 are faster than the N100 and even up to 2.5 times faster for memset.

willy · May 20, 2024, 9:05am

I made some power measurements in idle (the use case that corresponds the most to a home server or NAS). For this, I installed the board in an APLUS mini-itx CS-CUPID 2 enclosure which contains its own power adapter board (12V to ATX connector). I’ve equipped the system with 4 SSD, 2*intel X25M 160GB, 2*intel 530 180GB, and a 10GbE NIC (the one I mentioned here). I also tested an aliexpress 12V jack-to-ATX “160W” adapter. I measured the voltage an current at the connector. Here are my measurements:

12V via the aliexpress ATX 160W adapter: 1.06 * 11.83 = 12.54W
12V via the enclosure’s adapter board: 1.16A * 11.95V = 13.86W
12V via the motherboard’s jack, ATX adapter still connected: 1.15A * 11.95V = 13.74W
12V via the motherboard’s jack, ATX unplugged: 1.02A * 11.96V = 12.20W

Thus the board’s power design looks extremely efficient, beating the other two. There’s 1.7W saved here by powering the board via its own jack instead of the enclosure’s adapter. Pretty good!

In addition I measured the individual power draw of various components, all powered from the motherboard’s jack:

removed all SSDs: 0.81*12.03 = 9.74W => 2.45W drawn by the 4 SSDs in idle
no SSD and 10GbE link down: 0.59 * 12.14W = 7.16W => the 10GbE RJ45 link draws 2.58W alone.
no SSD, 10GbE adapter removed: 0.40 * 12.21 = 4.88W => the 10GbE adapter draws 2.28W with a link down, and 5.88W with a link up

Regardless, that does make an excellent 10GbE NAS: there are not that many options out there for 10G, fanless at 12W! I’m going to write more about the setup and share photos later. The SSDs are dated but sufficient to read at 10+Gbps (I think in fact we’re not far from saturating the PCIe x2 of the SATA controller).

willy · May 20, 2024, 10:49pm

I finally published a test covering the installation into an enclosure, quick performance testing at 10GbE and how to install the OS on the eMMC here: https://wtarreau.blogspot.com/2024/05/an-affordable-10gbe-capable-nas.html . Now’s time to have some sleep

BTW I’m wondering if the “LPDDR5 2400 MHz” message I’ve captured at boot could have anything to do with the comments about the RAM not being as fast as hoped, or if it’s only during the early boot that it’s like this and changes later.

dominik · May 21, 2024, 7:33am

Thanks for Your detailed blog post, interesting lecture with all the details about power consumption, this is extremely useful
I’ll study that later at night, I’m going same journey with ROCK 5B, hopefully I will have some time to describe my build just like You

willy · May 21, 2024, 9:07am

ROCK 5B is an awesome board as well. Many times I wanted to make an enclosure for it and turn it into yet-another server, but I figured I would miss this nice dev board, so I didn’t do it

dominik · May 21, 2024, 9:56am

5B with bifurcation adapter should be ok I was thinking about 5B+ but roobi instead of eMMC just killed this idea for me, all pcie lanes already gone for something else and NAS is build where You need those more than on anything else.

BTW: I’ll be using same ssd drives (480GB variant). I hope that they have more capacity at same power needed. Have You tried to open their covers to see pcb inside? I’m curios if there are less memory modules?

xela18954 · May 24, 2024, 12:22am

I got this adapter and works fine, https://www.amazon.com/dp/B0CPVRGV5D?psc=1&ref=ppx_yo2ov_dt_b_product_details.