Ahhh…! I’m starting to get the CPU thread thing. I still appreciate 10 t/s on 5B. But as for Incognito Man (no offence again) suggests to look the speed also on X4, I can’t ignore it.
Question: Radxa X4 vs Rock 5B plus for LLMs
I can’t have more cores than 4 on N100, it’s a 4-core CPU. YouTube reviewers failed to mount the cooler correctly and they had the first batch, which is corrected now, they are often quite unscrupulous and don’t correct videos later, don’t read too much into this.
soo, are you running the model and giving the results on your CPU? I’m waiting eagerly for it
I will run it later tonight.
Alright! Sticking with it…
The prompt I used:
“hello”, then
“tell me about the linux command uname
.”
Firebat T8 Plus (N100 DDR5-4800):
run 1
prompt eval rate: 65.07 tokens/s
eval rate: 10.64 tokens/s
run 2
prompt eval rate: 59.60 tokens/s
eval rate: 10.66 tokens/s
SOYO M2 Plus (N100 DDR4-2666), lower PL1/PL2:
run 1:
prompt eval rate: 58.16 tokens/s
eval rate: 8.47 tokens/s
run 2:
prompt eval rate: 79.49 tokens/s
eval rate: 8.38 tokens/s
For comparison, if run on Rock 5B improperly (on all 8 cores):
prompt eval rate: 18.04 tokens/s
eval rate: 4.95 tokens/s
prompt eval rate: 26.43 tokens/s
eval rate: 4.93 tokens/s
So if it was run properly using the 4 big cores, it would get around 7-8 tokens/s, similar to some slower N100 PCs, including the X4.
Thanks a lot for that!! You used ollama, right? So, If I move to llama.cpp
or llamafile
to run that qwen, I would get 2-3 tokens/sec more. Umm… to me, it seems X4 and Rock 5B is nearly same. I can’t decide which one to order😐 @willy @incognito
X4 can run both windows and linux
but Rock 5B is only linux based (with lower/older kernel version if I’m not wrong)
If for your use case both are the same, you’d rather pick the most flexible one, i.e. X4, because you can more easily repurpose it. It runs standard distros.
@incognito for many years you can use Docker to specify cores with the argument --cpuset-cpus="4-7"
@darkmode I have tried that also, indeed this limits ollama to 4 physical threads. But I think it still detects an 8 thread CPU (as lscpu
executed inside the container shows all 8), executes 8 threads on 4 cores and the performance is piss (I just wrote “hello”, did not have the patience to wait for the output of the other prompt)
total duration: 3m47.083569202s
load duration: 54.765472ms
prompt eval count: 30 token(s)
prompt eval duration: 27.361s
prompt eval rate: 1.10 tokens/s
eval count: 10 token(s)
eval duration: 3m19.654s
eval rate: 0.05 tokens/s
This is all quite stupid.
Amazingly stupid stuff, indeed! I tend to think it’s not using sched_getaffinity() then, which is the correct way to figure wihch CPUs are bound and respect taskset. Thus it might possibly decide to go check in /proc/cpuinfo or /sys. I don’t know if you can force it to use a given thread count, otherwise you might try to mount a truncated copy of /proc/cpuinfo over /proc/cpuinfo, and even unmount /sys in the docker container to make sure it doesn’t find other ways to discover your CPUs. But at this point, I’d declare the tool utterly broken. I have no idea what it looks like inside, but maybe you should try to contribute an option to force the desired number of threads and get it upstreamed. I guess you’d save all users of this thing on arm boards and maybe even for those running on modern intel P+E cores!
Yeah, I also have one of these (Lunar Lake) and I am probably losing some perf due to this… I’ll try to ask for it but so far I only saw people wanting more threads and the ollama devs saying that it’s not possible to make it use a specific number.
@willy you have amazing intuition, the fake /proc/cpuinfo worked! With this, it really runs on 4 cores and the performance is:
prompt eval rate: 24.34 tokens/s
eval rate: 4.99 tokens/s
prompt eval rate: 18.29 tokens/s
eval rate: 4.98 tokens/s
prompt eval rate: 6.25 tokens/s (that was without saying "hello" first)
eval rate: 5.05 tokens/s
So just very slightly higher.
Via open-webui, the response tokens/s is higher, around 6.8.
Here is the fake cpuinfo if someone could find it useful:
processor : 0
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
cpu model : Rockchip RK3588
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x2
CPU part : 0xd0b
CPU revision : 0
processor : 1
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
cpu model : Rockchip RK3588
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x2
CPU part : 0xd0b
CPU revision : 0
processor : 2
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
cpu model : Rockchip RK3588
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x2
CPU part : 0xd0b
CPU revision : 0
processor : 3
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
cpu model : Rockchip RK3588
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x2
CPU part : 0xd0b
CPU revision : 0
Serial : f4b09ed654bd5a24
docker command: docker run --cpuset-cpus="4-7" -d -v ollama:/root/.ollama --mount type=bind,src=/path/to/cpuinfo_fake,dst=/proc/cpuinfo -p 11434:11434 --name ollama ollama/ollama
Still half of what I’m getting on mine. What’s the quantization size of the model ? I suspect it’s Q8 if it’s exactly half of the perf I got at IQ4. At least we’re sure it’s not running on the small cores, because for me it runs at 2.3 t/s on the small ones.
This is probably just because llama.cpp is more bleeding edge, I think that ollama still doesn’t use NEON etc (well, it does on apple silicon, but ARM64 was always a second citizen).
I don’t believe in this. llama.cpp’s performance on this machine has not evolved a iota over the last year. Also here’s the performance I’m getting with various thread counts:
- 1: 3.15 t/s
- 2: 5.93 t/s
- 3: 8.43 t/s
- 4: 10.52 t/s
So as you can see, the extra CPU performane is quickly amortized by RAM speed and in your case it’s located between 1 and 2 threads. It cannot be just a few optimization which divide the CPU performance by that much. That’s why I suppose it’s more likely the quantization of the model instead that plays against performance here.
It does work. But you wanted to use qwen, I was testing qwen.
I’m not seeing it claim to be faster, it claims to embed llama.cpp in a way that’s portable and easy to use: they make huge cross-platform executables that contain the LLM and that you can run anywhere. The fact that a user got amazing perf by running it on their GPU just means their GPU is fast enough
Owh… I misunderstood your replies. I thought you wanted more speed in that docket environment.