I’ve conducted some comparative tests to measure the effect of the different DRAM generation. For this I’ve run llama.cpp on the Rock5B, the Rock5 ITX (under roobi), and ADLINK’s AADK based on an Ampere Altra Q80-26 (80 cores at 2.6 GHz). The Altra uses Neoverse-N1 but it’s exactly the same core as A76. LLMs are interesting because they’re often limited by the memory bandwidth during generation. Since my Rock5B has 4GB RAM, I’ve used the Phi3-3B model quantized at Q6_K (3.1 GB) and a small context of 512 tokens. I’m only using the big cores for this test.
- The Rock5B has its two big clusters running at 2256 and 2272 MHz respectively (hence 2264 avg). It uses LPDDR4x at 4224 MT/s. It parses at 5.58 tokens/s and produces 4.63 tokens/s.
- The Rock5 ITX has its two big clusters at 2287 and 2223 MHz respectively (2255 avg), and LPDDR5 at 5472 MT/s. It parses at 5.71 tokens/s and produces 4.85 tokens/s, hence 4.7% faster generation for 0.4% lower CPU frequency.
- The Rock5 ITX with only 3 threads instead of 4 drops to 3.80 t/s generation, above the theoretical 3.64 if we were CPU-bound, proving that the DRAM B/W is already the limiting factor when running under 4 threads.
- The Altra limited to 4 threads has its cores running at 2600 MHz and 6 single-DIMM 64-bit DDR4 channels at 2933 MT/s. It parses at 7.19 tokens/s and produces 5.96 tokens/s. Hence it’s respectively 28.8 and 28.7% faster than the 5B for 14.8% higher CPU frequency and 4.17x higher memory bandwidth.
- The highest generation speed the Altra reaches is around 40 threads at 22.80 tokens/s, or 3.8 times faster than with 4 threads, 4.92 times faster than Rock 5B or 4.7 times faster than Rock 5 ITX.
This shows that 6*2933 MT/s has a hard upper bound of 22.8 tok/s. This fixes an upper bound of 5.47 t/s for the Rock5B’s 4224 MT/s RAM and 7.09 t/s for the Rock5 ITX’s 5472 MT/s. Of course the CPUs also play a limiting factor here, but we’ve shown above that DRAM counts for the 4-thread test. If we perform a quick ratio calculation, 4.85/22.8*6*2933 shows that the Rock5-ITX delivers as if it was running DDR4 RAM at 3743 MT/s or 1872 MHz, and Rock5B as if it was at just 1800 MHz or DDR4-3600 (but again CPU does count here).
I would genuinely have expected a slightly higher gain between 5B and ITX (maybe 10-15%), but I’ve read above that there’s still this pending question about why LPDDR5 is not that much faster. With that said, it still is slightly faster (4.7% for a 0.4% slower CPU) but not by much.
Regardless that remains very good performance and it should be sufficient for most use cases. But anything we find to make LPDDR5 perform significantly better than LPDDR4X would be welcome I guess.
For those interested in reproducing these tests, I’ve used tag b2918 of llama.cpp, with Phi-3-mini-128k-instruct.Q6_K.gguf. The command line and (trimmed) output are:
willy@roobi:~/llama.cpp$ time taskset -c 4-7 ./main -c 512 -s 1 --temp 0.1 -n -1 --threads 4 -m ../models/Phi-3-mini-128k-instruct.Q6_K.gguf -e -p "<|im_start|>system\nYou're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>\n<|im_start|>user\nPlease explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant\n"
(...)
<s> <|im_start|>system
You're a super-smart AI assistant that never writes hallucinations, and you respond to the user's questions accurately.<|im_end|>
<|im_start|>user
Please explain to me what could be the benefits of running an LLM on a low-power processor like a Cortex A76 or a Neoverse-N1.<|im_end|><|im_start|>Assistant
Running a Large Language Model (LLM) on a low-power processor like the Cortex A76 or Neoverse-N1 could have several potential benefits.
1. **Energy Efficiency**: Low-power processors are designed to consume less power, which can lead to significant energy savings. This is particularly beneficial in large-scale deployments where energy consumption can be a major concern.
2. **Cost Savings**: Lower power consumption translates to lower energy costs. This can result in significant cost savings, especially in large-scale deployments.
3. **Environmental Impact**: Lower energy consumption also means a reduced environmental impact. This is particularly important in the context of climate change and the global effort to reduce carbon emissions.
4. **Heat Generation**: Lower power processors generate less heat. This can reduce the need for cooling systems, which can further reduce energy consumption and costs.
5. **Performance**: While it's important to note that low-power processors may not offer the same level of performance as high-power processors, they can still provide adequate performance for many applications.
In conclusion, running an LLM on a low-power processor can offer benefits in terms of energy efficiency, cost savings, environmental impact, heat generation, and adequate performance. However, the specific benefits would depend on the specific requirements and constraints of the application.<|endoftext|> [end of text]
(...)
system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_print_timings: load time = 842.03 ms
llama_print_timings: sample time = 17.60 ms / 342 runs ( 0.05 ms per token, 19431.82 tokens per second)
llama_print_timings: prompt eval time = 18920.53 ms / 108 tokens ( 175.19 ms per token, 5.71 tokens per second)
llama_print_timings: eval time = 70334.79 ms / 341 runs ( 206.26 ms per token, 4.85 tokens per second)
llama_print_timings: total time = 89350.71 ms / 449 tokens