Below are my steps to get DeepSeek-R1-Distill-Qwen-1.5B run on Orion O6 CPU
Model Download
Link: - https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Save model such as DeepSeek-R1-Distill-Qwen-1.5B
Llama.cpp
Compilation on host x86_64 ubuntu
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b4641
mkdir build && cd build
cmake -DGGML_LLAMAFILE=OFF ..
make -j
Quantization on host x86_64 ubuntu
Format Conversion
pip install -r requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B
Quantization
Quantize with Q4_K_M (Will try other methods later)
./build/bin/llama-quantize DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M
Performance on O6
Runtime
Since the O6 rootfs already have llamacpp integrated, which is great, or I have to run the cross-build again, aha!
We just need to copy DeepSeek-R1-Distill-Qwen-1.5B_Q4_K_M.gguf to O6 rootfs and execute the following command:
taskset -c 0,5,6,7,8,9,10,11 llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -c 4096 -t 8 -p "Please introduce HongKong in China."
Benchmark
In order to get a better perf, I have set the CPUs to performance governor with below cmd, just need to loop all the cpus which X loops from 0 to 11.
echo performance > /sys/devices/system/cpu/cpu**X**/cpufreq/scaling_governor
The Benchmark cmd is as below:
taskset -c 0,5,6,7,8,9,10,11 llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -pg 128,128 -t 8