Llama.cpp - Benchmarks

Hi all, while running llama-bench on larger models (e.g. 32B+ Q4_KM), I often kept getting hard crashes on my Orion O6 (fan remains at full pelt and it became inaccessible, requiring hard reset).

Turns out this was just my system config that needed tweaking to make sure I wasn’t overcommitting to memory that I didn’t have available.

With the follow tweaks, larger models can be benched without issue (they’re obviously slow though).

sudo sh -c 'echo 2 > /proc/sys/vm/overcommit_memory'
sudo sh -c 'echo 80 > /proc/sys/vm/overcommit_ratio'
1 Like

I got my O6 working again (BIOS issue, corrupted the SPI Flash) and gave the Cix GO drivers a shot on the default Debian Radxa image. Roughly:

##
# Update system packages
##
apt update
apt upgrade

###
# CIX GO drivers: https://developer.cixtech.com/
###

# Uninstall existing packages
./uninstall.sh

# Install new Cix GO packages
./install.sh

###
# LLAMA
###

# Clone Llama
git clone https://github.com/ggml-org/llama.cpp.git

# Make change to disable grouping feature
# NOTE: I had to make a few other changes in Vulkan file to disable `VK_EXT_layer_settings`
#       These aren't supported on the vulkan-dev package that's available in the Debian repo. I've read that this shouldn't impact performance, but maybe it's the reason why it's slow for me?

# Build Llama.
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j 8

# Run benchmark
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-bench -m ../llm/qwen2.5-3b-instruct-q4_0.gguf -pg 128,128 -t 8 -ngl 1000

Unfortunately, I couldn’t get close to the speeds @Robin_binbin got ( Does Vulkan work? - #26 by Robin_binbin ):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |           pp512 |         18.35 ± 0.04 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |           tg128 |         16.47 ± 0.02 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |     pp128+tg128 |         16.53 ± 0.01 |

@Robin_binbin Was there anything else you might’ve done? I did notice that vulkaninfo seems to crash for me still:

radxa@orion-o6:~/Projects/llama.cpp$ vulkaninfo
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.239

# OMITTED FOR BREVITY

ERROR: [Loader Message] Code 0 : vkCreateDevice:  Failed to create device chain.
ERROR at ./vulkaninfo/vulkaninfo.h:1362:vkCreateDevice failed with ERROR_INITIALIZATION_FAILED
vulkaninfo: ./vulkaninfo/outputprinter.h:200: Printer::~Printer(): Assertion `!object_stack.empty() && "mismatched number of ObjectStart/ObjectEnd or ArrayStart/ArrayEnd's"' failed.
Aborted

Note that I did have to disable VK_EXT_layer_settings to get to compile with the Vulkan SDK that comes stock in Debian Repos.

I made the following modification to llama.cpp which impoves the prefill score.

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 3019a545d..53dbf18b8 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3030,29 +3030,23 @@ static vk_device ggml_vk_get_device(size_t idx) {
                 pipeline_robustness = true;
             } else if (strcmp("VK_EXT_subgroup_size_control", properties.extensionName) == 0) {
                 device->subgroup_size_control = true;
-#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT")) {
                 device->coopmat_support = true;
                 device->coopmat_m = 0;
                 device->coopmat_n = 0;
                 device->coopmat_k = 0;
-#endif
-#if defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
             } else if (strcmp("VK_NV_cooperative_matrix2", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT2")) {
                 coopmat2_support = true;
-#endif
 #if defined(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_shader_integer_dot_product", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_INTEGER_DOT_PRODUCT")) {
                 device->integer_dot_product = true;
 #endif
-#if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_shader_bfloat16", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_BFLOAT16")) {
                 bfloat16_support = true;
-#endif
             }
         }

Regarding vulkaninfo, I note you executed the command:

apt upgrade

This may have upgraded some vulkan related libs on your system. As llama.cpp works well with mali vulkan driver, I think you might try to specify vulkan icd to mali when runing vulkaninfo:

export VK_ICD_FILENAMES=/etc/vulkan/icd.d/mali.json

Forget to mention I have applied Kevin’s patch in llama.cpp as what he discribed below.

Thanks for the details, appreciate it! And sorry for late reply - have been sick the past week.

I did have another play around with all this yesterday but, unfortunately, am still stuck at very low PP (~10tps for 3B Q4). I’m not too sure why yet, but I did try a few other things:

  1. Built VulkanSDK for latest glslc which correctly auto-detects the Vulkan extensions above (GML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT , etc) For context, with the latest Git versions of Llama.cpp, I had a bit of trouble building and there’s a many #if defined … statements for the extensions in the ggml-vulkan.cpp file.
  2. Made sure to export export VK_ICD_FILENAMES=/etc/vulkan/icd.d/mali.json - which DOES fix the vulkaninfo issue.
  3. Probably a lot of other things too - but I can’t recall them all sorry.

When I get some time, I might try again from scratch with the Default Radxa Image… it’s possible, with all my tinkering, I’ve messed something up.

I do think the Cix GO driver itself is working though - using the monitoring tool mentioned in their docs, I was able to see the Mali GPU Cores all go to 100% while Llama was running, so I suspect the issue is probably somewhere in the Llama.cpp version I’m using.

There is a Llama.cpp PR for Mali G720 tuning.

I haven’t tried this yet, but might improve performance.