Question: Radxa X4 vs Rock 5B plus for LLMs

I did some research on SBCs for running LLMs. My target is to run LLM (with tool calling support) and make a Voice AI.
At first, searching SBCs, I came up with Raspberry Pi 5. But I watched my youtube reviews and came to know that, RPi 5 doesn’t have any GPU for LLMs to run. The 3b models lags terrible in it. As I have seen in some videos, LLAMA 3.2 3b only generates 1.78 tokens per second. That is not expected.

Then I came up with the name Radxa. Radxa SBCs doesn’t have much videos on youtube. But, I watched a few and saw that Radxa X4 has 4GB GPU according to https://youtu.be/F2atAHDOaIA video at 7:45
That would be enough for running a 3B model, wouldn’t it?

Again, according to the docs of Radxa, Rk3588 Chip SBC has another advantage. The Radxa Rock 5B plus can generate 6 tokens per second on Phi3 3.8B model. (Docs: https://docs.radxa.com/en/rock5/rock5c/app-development/rkllm_usage)

Main Question

To sum up, Raspberry Pi 5 is terrible for running 3B LLMs, Radxa X4 has 4 GB GPU which should be enough for running 3B Models, and Radxa Rock 5B plus has The RK3588 chip advantage. My question is, If I run Qwen 3B Model, which one should I buy? Radxa X4? or Rock 5B plus?

N.B: I don’t have any CPU/GPU knowledge, so I don’t know which is better. A community support would be great🙂

1 Like

For local LLM, Radxa Fogwise Airbox is a better choice. Airbox can run 7B model with 20 tokens.

Thanks for the suggestion. But that would be too much for me. Both the size and the price. I want something smaller in height. Besides, I have my Llama 70b running in Cloud server. But, for making the Voice AI a little usable in offline, I need to run Qwen 3b which outperformed Llama 3b while tools calling. So, I only need something that is capable of running it normally. I’ll also consider your choice after doing more research. Thanks!

We are going to introduce a smaller and lower cost SBC which can run 3b model this year. Stay tuned.

5 Likes

Umm… So, there isn’t any SBC yet that can run 3B models? I only need around 10 tokens per second😐. Can you at least compare these 2 SBCs Radxa x4 and Rock 5B Plus?

At 10 t/s for a Qwen2.5 3B model, the Rock5B definitely works. Mine processes the prompt at 15 t/s and generates at 10 t/s at IQ4_XS quantization:

$ taskset -c 4-7 ./llama-cli -c 1024 -n 100 --threads 4 --model /mnt/models/Qwen2.5-3B-Instruct-IQ4_XS.gguf -p "Could you please explain to me in a few sentences what llama.cpp is ?"
Could you please explain to me in a few sentences what llama.cpp is ? llama.cpp is a C++ code library that implements the LLaMA model, which is a large-scale pre-trained language model. The code library provides functions to load, fine-tune, and use the LLaMA model for tasks such as text generation, language understanding, and conversational interactions. LLaMA is known for its high performance and accuracy, and the llama.cpp library makes it accessible for developers to incorporate into their applications. [end of text]


llama_perf_sampler_print:    sampling time =      17.26 ms /   104 runs   (    0.17 ms per token,  6024.79 tokens per second)
llama_perf_context_print:        load time =     742.52 ms
llama_perf_context_print: prompt eval time =     998.04 ms /    15 tokens (   66.54 ms per token,    15.03 tokens per second)
llama_perf_context_print:        eval time =    8631.99 ms /    88 runs   (   98.09 ms per token,    10.19 tokens per second)
llama_perf_context_print:       total time =    9668.12 ms /   103 tokens

You can even use the Q4_0_4_4 quantization that just doubles the prompt processing speed to 32 t/s (it’s CPU bound) and just slightly improves text generation to 10.71 t/s.

I could also run the 7B model quantized at 3 bits (I only have 4GB) but it’s super slow (2.2 t/s prompt, 1.84 t/s gen). 3B definitely is the size to go with a Rock5 IMHO.

BTW, running the same models on the Orion O6 gives 41.83 and 21.44 t/s for the 3B model, and 29.12 and 9.09 t/s for the 7B model, respectively!

2 Likes

Thanks a lot for confirming it. I’ll put Rock 5B+ in affirmative list. But, still waiting for someone to dig up and post the results for Radxa X4.

Anyways, May I ask an out of topic question? Are you using M.2 SSD with it? If so, can you tell the Model name of that SSD which works? Just a request from a newbie over here🫠

Yes I’m using an M2 SSD. It’s a cheap “KingSpec” 250GB one I bought on Ali just to store LLMs. I have the same (albeit 128G) on my Orion O6. Both can reach 1 GB/s on their respective SBCs. They look better than what I imagined when I ordered them :wink:

1 Like

Thanks again! I will buy a whole set of Rock 5B plus. I confirmed that, 5B plus is way better than X4 for LLM. X4 has only 4 cores, and 5B plus has 8 cores. Again Rock 5B Plus has NPU advantage according to this https://youtu.be/KB9qRwj1pnU video.

Note that here I’m only using the 4 big cores, since the 4 little ones are pointless for an LLM and slow it down. Also I haven’t used the NPU as I think it’s still unusable for llama.cpp. While it could possibly improve prompt processing, the generation still remains memory-bound anyway. Last point, I’m using 5B, not 5B+. 5B is using LPDDR4X at 4224 MT/s while 5B+ uses LPDDR5-4800. In theory 5B+ should be slightly faster. In practice I found the 5ITX (which also uses LPDDR5-4800) to be slightly slower but the difference is negligible.

1 Like

The cores on N100 are faster than the A76 cores. Also, there is no “NPU advantage” because you have to either use the CPU or the NPU. Another thing: for now, it’s quite difficult to use the Intel integrated GPU to run an LLM and you’ll probably be using the CPU. If used though, the GPU will probably be quite fast.

1 Like

@willy sorry if this is off-topic, but I am actually going insane while trying to do this.

How are you running the LLM on 4 cores? I have a working installation of dockerized Ollama and there seems to be absolutely no way to tell it to use fewer than 8 cores. I tried changing some env variables (OLLAMA_NUM_THREADS), passing some parameters to Ollama (--num_threads), finally I tried changing the Docker “Entrypoint” to /bin/taskset and the command to something like cpuset-cpus 3-7 /bin/ollama serve, but nothing works. If there was a way to tell docker to only expose some cores it would be great, as I think that when I limit the cpus (in a Docker Run cmd) to 4, Ollama might be still detecting 8 threads and running 8 threads on 4 cores, which is super inefficient.
(note to self: taskset should be 4-7)

I’ve never used ollama. My understanding (possibly wrong) is that it’s only an eye-candy wrapper on top of llama.cpp and I really hate wrappers exactly for the reasons you’re facing: they hide you the important stuff and when you want to do something a bit finer, you simply cannot. I’m just using the regular llama.cpp which I indeed start using taskset -c 4-7 ./build/bin/llama.cpp -t 4

the advantage is that it can integrate with open-webui and you get a nice Web interface with conversation memory etc, and it can download models automatically. Using terminal every time is a pain.
https://gist.github.com/qlyoung/abd217f977399003ba0cc277feca2af9 Docker is such a joke!
/end off topic

well, unless I’m mistaken, that’s simply what llama-server is (it’s built along all the tools as part of llama.cpp) :slight_smile:

I can see some value in this, but not having infinite bandwidth nor storage, I do appreciate very much to choose the quantization level based on the file size, and to read the model cards as well.

I agree. It’s where computer science stopped, and where the machines started to win in face of their owners, with everyone getting used to finding this normal. Not my world (yet)…

2 Likes

About that cpu cores, if you use whole 8 cores, won’t it be faster than using 4 cores? even if the other 4 cores are slower?

Hey incognito man, (no offence) I realized that CPU and NPU cannot be used at once. That configuration is not likely known over internet. So I guess I’m back to square one. Anyways, do you have a Radxa X4? If so, can you run ollama run qwen2.5:3b --verbose and show us the eval rate? i.e. tokens per second? That might help me a lot.

I don’t have an X4 but I can run it on an N100 mini PC (DDR5-4800, power limits 30W, proper 64-bit DDR5). It should be a bit faster than X4 but will give you some idea. I can also run it on an N100 with a lower power limit and DDR4 (but still, no 32-bit “half channel” memory of X4).
I think LLMs run as fast as the slowest processing unit, so mixing various processing methods won’t help much.

It’s much slower (1/4 or so). The reason is simple: llama distributes the load equally between all threads. So everything progresses at the speed of the slowest one. That’s exactly the same problem as with some parallel compression programs which need some rendez-vous points between all threads.

Generally speaking, users should really consider that “efficient” cores are only for tasks that are not on any critical path, and that they provide best effort execution but should never be mixed with other cores for a given workload. For LLMs that’s critical. In any case, llama is not limited by the CPU cores but by the memory bandwidth on such devices so that’s not that big of a problem to have only 4 cores vs 8.

Okk. That will also do it. Please, run it in that N100 cpu with 4 cores only (in case if you have more than 4 cores). It will also help me imagining the speed.
About X4, as far as I have seen youtube reviews, they all report for heat issue. I think X4 can’t handle its CPU and iGPU with in its limited space.