DeepSeek-R1-Distill-Qwen-1.5B on Orion O6 CPU

shinoda · March 12, 2025, 5:41am

Below are my steps to get DeepSeek-R1-Distill-Qwen-1.5B run on Orion O6 CPU

Model Download

Link: - https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Save model such as DeepSeek-R1-Distill-Qwen-1.5B

Llama.cpp

Compilation on host x86_64 ubuntu

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b4641
mkdir build && cd build
cmake -DGGML_LLAMAFILE=OFF ..
make -j

Quantization on host x86_64 ubuntu

Format Conversion

pip install -r requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B

Quantization

Quantize with Q4_K_M (Will try other methods later)

./build/bin/llama-quantize DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

Performance on O6

Runtime

Since the O6 rootfs already have llamacpp integrated, which is great, or I have to run the cross-build again, aha!
We just need to copy DeepSeek-R1-Distill-Qwen-1.5B_Q4_K_M.gguf to O6 rootfs and execute the following command:

taskset -c 0,5,6,7,8,9,10,11 llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -c 4096 -t 8 -p "Please introduce HongKong in China."

Benchmark

In order to get a better perf, I have set the CPUs to performance governor with below cmd, just need to loop all the cpus which X loops from 0 to 11.

echo performance > /sys/devices/system/cpu/cpu**X**/cpufreq/scaling_governor

The Benchmark cmd is as below:

taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -pg 128,128 -t 8

dasbestetag · March 12, 2025, 1:38pm

is cix’s model hub applicable also?