DeepSeek-R1-Distill-Qwen-1.5B on Orion O6 CPU

Below are my steps to get DeepSeek-R1-Distill-Qwen-1.5B run on Orion O6 CPU

Model Download

Link: - https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Save model such as DeepSeek-R1-Distill-Qwen-1.5B

Llama.cpp

Compilation on host x86_64 ubuntu

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b4641
mkdir build && cd build
cmake -DGGML_LLAMAFILE=OFF ..
make -j

Quantization on host x86_64 ubuntu

Format Conversion

pip install -r requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B

Quantization

Quantize with Q4_K_M (Will try other methods later)

./build/bin/llama-quantize DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

Performance on O6

Runtime

Since the O6 rootfs already have llamacpp integrated, which is great, or I have to run the cross-build again, aha!
We just need to copy DeepSeek-R1-Distill-Qwen-1.5B_Q4_K_M.gguf to O6 rootfs and execute the following command:

taskset -c 0,5,6,7,8,9,10,11 llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -c 4096 -t 8 -p "Please introduce HongKong in China."

Benchmark

In order to get a better perf, I have set the CPUs to performance governor with below cmd, just need to loop all the cpus which X loops from 0 to 11.

echo performance > /sys/devices/system/cpu/cpu**X**/cpufreq/scaling_governor

The Benchmark cmd is as below:

taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -pg 128,128 -t 8
5 Likes

is cix’s model hub applicable also?

yes, just apply from cix ebp program,

support.cixtech.com

Note the Model Hub is also recently available in Github without needing to register with CIX.