I agree and that’s exactly my point. That’s also why I asked FriendlyELEC to install the SPI NOR on their latest NanoPC-T6 and to flash it (though I don’t think this last point has been done yet). As a PC user I don’t like UEFI because it’s more complicated than the legacy mode without adding visible value. But for ARM it’s the least painful solution that works for everyone out of the box, and the code exists for RK3588.
ROCK 5 in ITX form factor
BTW, ps:ps
worked fine on the login prompt (with some script syntax errors but that’s all):
roobi login: ps
Password:
Linux roobi 5.10.110-33-rockchip #65700d485 SMP Wed Apr 3 04:26:57 UTC 2024 aarch64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri May 17 19:29:29 UTC 2024 on tty1
-bash: [: : integer expression expected
-bash: [: : integer expression expected
-bash: [: : integer expression expected
ps@roobi:~$
I also found that the board indeed got a DHCP address, and sending a browser to it indeed shows the roobi interface. Another point that is not great for on-the-table setup is this:
Due to remote access to this device, authentication is required. After clicking the start button,
please press the power button three times within 60 seconds to complete the verification.
/me using the screwdriver again to short pins.
It then allows you to choose an OS image among two, and claims “the medium is not detected” (even when I click on the refresh circle on the right). This is exactly the reason I absolutely hate and despise such proprietary installation tools. Nobody uses them, so they are extremely lightly tested and full of bugs and limitations that make them a total pain to deal with.
Bah, for now I’ll run my tests from this image. It’s already a full-fledged OS, even gcc is installed on it! I don’t know what the intent was for a pre-installation image but it will be useful anyway
Is there an installation medium (other than eMMC which should not be user-accessible on Rock 5 ITX by default anyway since it’s Roobi’s place to live)?
No, I have not yet added any other one, I naturally expected such a board to have its data on either SATA or M2 and the OS on the eMMC, I mean, what anyone would likely expect on a server board! Sacrifying the eMMC for a boot loader is non-sense, particularly when that means that you’ll either need to move your OS to the data disks or dedicate one drive for the OS. I really don’t understand such baroque choices. If the goal is to make it as irritating as using a Synology NAS I can understand, but I’m not seeing how one sees value in doing that :-/
I’m starting to get a feeling that the board is really awesomely designed from a hardware perspective, maybe one of the best product from Radxa to date, and that some incomprehensible software choices are going to hinder its adoption by spreading words of pain of setup and use :-/ I’m a bit confused.
Also, there’s a micro-SD slot on the board that supports UHS mode and reads at 88MB/s from a 64G SD I have here. If there’s one please where the Roobi gadget should be placed, it’s exactly on a micro-SD. Everyone will easily find a spare one, download the Roobi image, place it on the SD and install the OS from there, then eject the SD once finished. No need to sacrify the eMMC for that single-use stuff. And I would strongly prefer to install the OS on an eMMC than on a micro-SD.
Nope, users fail to select the right OS image (please note: there’s not one Roobi OS install but these are device specific since the SBC vendor doesn’t ship with SPI NOR preflashed with something that would allow to boot generic aarch64 images).
And (even most experienced) users especially with Radxa are not even able to find the images for a specific device.
As for the target audience: Radxa advertises this thing as ‘ARM PC’ and desktop users want their OS on a NVMe SSD since they have heard that MB/s are that great and that important (nobody in that world understands the difference between random and sequential I/O). The average desktop user in front of such an ‘ARM PC’ will never notice there’s eMMC inside since it only contains some ‘firmware’ allowing the OS image of choice to be flashed to an installed SSD in the M.2 slot.
For desktop I agree that you want to install the OS on M.2. For a server with no other PCIe slot you’ll want to use M.2 to connect PCIe devices (extra SATA slots or network cards).
And it doesn’t change the fact that Radxa could provide the roobi SD image for this board as they used to do in the past for their other OS images. Again, I’m fine with providing a user-friendly installer, but not at the cost of removing the only viable OS storage of the board. Right now I’m running off this distro which is going to be the main OS of this platform. This looks like a particularly weird and ridiculous situation where every storage device was shifted one place:
- SPI NOR: empty, not used
- eMMC: should contain the main OS, instead contains u-boot and this roobi gadget that takes all the room left (7G partition dedicated to this!!!)
- M2: no longer usable for PCIe / SATA if you’re forced to move your OS to an NVME SSD due to this roobi you’ll never ever use again that steals your eMMC.
- SD: not used during the install process
- USB: not used during the install process
A correct setup would be:
- SPI NOR: boot loader (u-boot, edk2, whatever etc)
- eMMC: main OS. Optionally reserve a few MB for a recovery OS if the NOR is too tight.
- SATA: data storage
- M.2: either main data storage (SSD), complementary data storage (SATA), network, other?
- SD/USB: usable during boot to plug either a standard installation image for the main OS (ubuntu, freebsd, maybe even windows, I don’t know), and usable as well for reinstallation by using the user-friendly roobi installer.
And what’s sad is that everything is present and properly wired on the product for this, it’s just totally mixed up at the software level!
I’ve conducted some comparative tests to measure the effect of the different DRAM generation. For this I’ve run llama.cpp on the Rock5B, the Rock5 ITX (under roobi), and ADLINK’s AADK based on an Ampere Altra Q80-26 (80 cores at 2.6 GHz). The Altra uses Neoverse-N1 but it’s exactly the same core as A76. LLMs are interesting because they’re often limited by the memory bandwidth during generation. Since my Rock5B has 4GB RAM, I’ve used the Phi3-3B model quantized at Q6_K (3.1 GB) and a small context of 512 tokens. I’m only using the big cores for this test.
- The Rock5B has its two big clusters running at 2256 and 2272 MHz respectively (hence 2264 avg). It uses LPDDR4x at 4224 MT/s. It parses at 5.58 tokens/s and produces 4.63 tokens/s.
- The Rock5 ITX has its two big clusters at 2287 and 2223 MHz respectively (2255 avg), and LPDDR5 at 5472 MT/s. It parses at 5.71 tokens/s and produces 4.85 tokens/s, hence 4.7% faster generation for 0.4% lower CPU frequency.
- The Rock5 ITX with only 3 threads instead of 4 drops to 3.80 t/s generation, above the theoretical 3.64 if we were CPU-bound, proving that the DRAM B/W is already the limiting factor when running under 4 threads.
- The Altra limited to 4 threads has its cores running at 2600 MHz and 6 single-DIMM 64-bit DDR4 channels at 2933 MT/s. It parses at 7.19 tokens/s and produces 5.96 tokens/s. Hence it’s respectively 28.8 and 28.7% faster than the 5B for 14.8% higher CPU frequency and 4.17x higher memory bandwidth.
- The highest generation speed the Altra reaches is around 40 threads at 22.80 tokens/s, or 3.8 times faster than with 4 threads, 4.92 times faster than Rock 5B or 4.7 times faster than Rock 5 ITX.
This shows that 6*2933 MT/s has a hard upper bound of 22.8 tok/s. This fixes an upper bound of 5.47 t/s for the Rock5B’s 4224 MT/s RAM and 7.09 t/s for the Rock5 ITX’s 5472 MT/s. Of course the CPUs also play a limiting factor here, but we’ve shown above that DRAM counts for the 4-thread test. If we perform a quick ratio calculation, 4.85/22.8*6*2933 shows that the Rock5-ITX delivers as if it was running DDR4 RAM at 3743 MT/s or 1872 MHz, and Rock5B as if it was at just 1800 MHz or DDR4-3600 (but again CPU does count here).
I would genuinely have expected a slightly higher gain between 5B and ITX (maybe 10-15%), but I’ve read above that there’s still this pending question about why LPDDR5 is not that much faster. With that said, it still is slightly faster (4.7% for a 0.4% slower CPU) but not by much.
Regardless that remains very good performance and it should be sufficient for most use cases. But anything we find to make LPDDR5 perform significantly better than LPDDR4X would be welcome I guess.
For those interested in reproducing these tests, I’ve used tag b2918 of llama.cpp, with Phi-3-mini-128k-instruct.Q6_K.gguf. The command line and (trimmed) output are:
willy@roobi:~/llama.cpp$ time taskset -c 4-7 ./main -c 512 -s 1 --temp 0.1 -n -1 --threads 4 -m ../models/Phi-3-mini-128k-instruct.Q6_K.gguf -e -p "<|im_start|>system\nYou're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>\n<|im_start|>user\nPlease explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant\n"
(...)
<s> <|im_start|>system
You're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>
<|im_start|>user
Please explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant
Running a Large Language Model (LLM) on a low-power processor like the Cortex A76 or Neoverse-N1 could have several potential benefits.
1. **Energy Efficiency**: Low-power processors are designed to consume less power, which can lead to significant energy savings. This is particularly beneficial in large-scale deployments where energy consumption can be a major concern.
2. **Cost Savings**: Lower power consumption translates to lower energy costs. This can result in significant cost savings, especially in large-scale deployments.
3. **Environmental Impact**: Lower energy consumption also means a reduced environmental impact. This is particularly important in the context of climate change and the global effort to reduce carbon emissions.
4. **Heat Generation**: Lower power processors generate less heat. This can reduce the need for cooling systems, which can further reduce energy consumption and costs.
5. **Performance**: While it's important to note that low-power processors may not offer the same level of performance as high-power processors, they can still provide adequate performance for many applications.
In conclusion, running an LLM on a low-power processor can offer benefits in terms of energy efficiency, cost savings, environmental impact, heat generation, and adequate performance. However, the specific benefits would depend on the specific requirements and constraints of the application.<|endoftext|> [end of text]
(...)
system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_print_timings: load time = 842.03 ms
llama_print_timings: sample time = 17.60 ms / 342 runs ( 0.05 ms per token, 19431.82 tokens per second)
llama_print_timings: prompt eval time = 18920.53 ms / 108 tokens ( 175.19 ms per token, 5.71 tokens per second)
llama_print_timings: eval time = 70334.79 ms / 341 runs ( 206.26 ms per token, 4.85 tokens per second)
llama_print_timings: total time = 89350.71 ms / 449 tokens
And for those wondering how x86 would behave here, the N5105 in my Odroid-H3 with its 4 cores at 2.8 GHz has two DDR4-3200 DIMMs. It delivers only 1.85 tokens/s on this test, or only 38% of the Rock5-ITX! In this case it’s likely that the lack of AVX counts. But given that newer CPUs such as N100 have cut their DRAM bandwidth in half to segment their market, they won’t do much better than the Rock5 here anyway. This, I think, makes RK3588 closer to recent x86 chips which have been purposely castrated, and it means that Rock5-ITX very likely has some chances to be compared to other PC motherboards for various setups.
Here are the raw numbers from this test:
system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
(...)
llama_print_timings: load time = 1400.98 ms
llama_print_timings: sample time = 15.33 ms / 309 runs ( 0.05 ms per token, 20161.82 tokens per second)
llama_print_timings: prompt eval time = 49784.80 ms / 108 tokens ( 460.97 ms per token, 2.17 tokens per second)
llama_print_timings: eval time = 166469.62 ms / 308 runs ( 540.49 ms per token, 1.85 tokens per second)
llama_print_timings: total time = 216500.45 ms / 416 tokens
Can you elaborate on that please? I know there’s this single vs. dual channel ‘issue’ but according to tinymembench scores your N5105 is outperformed by an N100: https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md
Bah never mind, I wrongly used the term “bandwidth” instead of the term “bus width”. BTW N100 is faster but not by a large margin, and having one channel will not let it issue two parallel fetches at once. It’s also interesting to see that some of the rock5 are faster than the N100 and even up to 2.5 times faster for memset.
I made some power measurements in idle (the use case that corresponds the most to a home server or NAS). For this, I installed the board in an APLUS mini-itx CS-CUPID 2 enclosure which contains its own power adapter board (12V to ATX connector). I’ve equipped the system with 4 SSD, 2*intel X25M 160GB, 2*intel 530 180GB, and a 10GbE NIC (the one I mentioned here). I also tested an aliexpress 12V jack-to-ATX “160W” adapter. I measured the voltage an current at the connector. Here are my measurements:
- 12V via the aliexpress ATX 160W adapter: 1.06 * 11.83 = 12.54W
- 12V via the enclosure’s adapter board: 1.16A * 11.95V = 13.86W
- 12V via the motherboard’s jack, ATX adapter still connected: 1.15A * 11.95V = 13.74W
- 12V via the motherboard’s jack, ATX unplugged: 1.02A * 11.96V = 12.20W
Thus the board’s power design looks extremely efficient, beating the other two. There’s 1.7W saved here by powering the board via its own jack instead of the enclosure’s adapter. Pretty good!
In addition I measured the individual power draw of various components, all powered from the motherboard’s jack:
- removed all SSDs: 0.81*12.03 = 9.74W => 2.45W drawn by the 4 SSDs in idle
- no SSD and 10GbE link down: 0.59 * 12.14W = 7.16W => the 10GbE RJ45 link draws 2.58W alone.
- no SSD, 10GbE adapter removed: 0.40 * 12.21 = 4.88W => the 10GbE adapter draws 2.28W with a link down, and 5.88W with a link up
Regardless, that does make an excellent 10GbE NAS: there are not that many options out there for 10G, fanless at 12W! I’m going to write more about the setup and share photos later. The SSDs are dated but sufficient to read at 10+Gbps (I think in fact we’re not far from saturating the PCIe x2 of the SATA controller).
I finally published a test covering the installation into an enclosure, quick performance testing at 10GbE and how to install the OS on the eMMC here: https://wtarreau.blogspot.com/2024/05/an-affordable-10gbe-capable-nas.html . Now’s time to have some sleep
BTW I’m wondering if the “LPDDR5 2400 MHz” message I’ve captured at boot could have anything to do with the comments about the RAM not being as fast as hoped, or if it’s only during the early boot that it’s like this and changes later.
Thanks for Your detailed blog post, interesting lecture with all the details about power consumption, this is extremely useful
I’ll study that later at night, I’m going same journey with ROCK 5B, hopefully I will have some time to describe my build just like You
ROCK 5B is an awesome board as well. Many times I wanted to make an enclosure for it and turn it into yet-another server, but I figured I would miss this nice dev board, so I didn’t do it
5B with bifurcation adapter should be ok I was thinking about 5B+ but roobi instead of eMMC just killed this idea for me, all pcie lanes already gone for something else and NAS is build where You need those more than on anything else.
BTW: I’ll be using same ssd drives (480GB variant). I hope that they have more capacity at same power needed. Have You tried to open their covers to see pcb inside? I’m curios if there are less memory modules?
I got this adapter and works fine, https://www.amazon.com/dp/B0CPVRGV5D?psc=1&ref=ppx_yo2ov_dt_b_product_details.
I noticed something else that could be improved: the CR1220 battery is very hard to source, with more than 2 months shipping time everywhere I find one. There’s some room near the battery holder, a two-pin connector could have been nice there to connect an external battery, as many CR2016/2032 batteries exist with a tiny 2-pin connector (I don’t remember which one but I have plenty, it looks like a 1.27 or 1.5mm pitch).
For those who, like me, are having difficulties finding CR1220 batteries, I finally managed to find a CR1216 one in a DIY store here, and I’m glad to say that it fits well. I was planning on trying to make a battery-sized adapter made of PCB but found that battery just before doing it
Two small suggestions of improvements for ROCK 5 ITX:
- place a small battery header close to the RTC battery holder, or use a vertical holder for the ubiquitous CR2032. All those I saw were PTH not SMD but I don’t think it would be that big of a deal to drill two extra holes for such a holder.
- add a small header for the maskrom button to help with alternate booting. I would even place a 20k resistor in series with it so that by default, shorting it would force the boot from the SD card. It would make it extremely convenient to test new boot loaders without having to reflash the one on the eMMC nor touch the SPI, this indeed allows to force a temporary boot attempt. I did it by touching the right part of the button with a multimeter probe, but exposing that facility should significantly help users test new boot loaders.
Hmmm finally a third suggestion. A mini-ITX board is usually meant to be installed in a small enclosure. We’ve seen example photos at the top where the micro-SD arrives on the edge of the case. Mine is no exception and I have to push hard on the disk cables to manage to just reach it. Other 1U enclosures with a small included PSU will also often have it there, making it hard to impossible to insert/change the SD. It looks like there’s free space on the bottom right, and enclosures are usually a bit deeper at least to leave room for cables or front drives or front panel. I think that placing the micro-SD slot under the EDP connector at the bottom of the board and facing the right side would be much more convenient, as in the suggestion below: