Llama.cpp - Benchmarks

jimhamiru · November 25, 2025, 11:18pm

Hi all, while running llama-bench on larger models (e.g. 32B+ Q4_KM), I often kept getting hard crashes on my Orion O6 (fan remains at full pelt and it became inaccessible, requiring hard reset).

Turns out this was just my system config that needed tweaking to make sure I wasn’t overcommitting to memory that I didn’t have available.

With the follow tweaks, larger models can be benched without issue (they’re obviously slow though).

sudo sh -c 'echo 2 > /proc/sys/vm/overcommit_memory'
sudo sh -c 'echo 80 > /proc/sys/vm/overcommit_ratio'

jimhamiru · December 29, 2025, 10:02am

I got my O6 working again (BIOS issue, corrupted the SPI Flash) and gave the Cix GO drivers a shot on the default Debian Radxa image. Roughly:

##
# Update system packages
##
apt update
apt upgrade

###
# CIX GO drivers: https://developer.cixtech.com/
###

# Uninstall existing packages
./uninstall.sh

# Install new Cix GO packages
./install.sh

###
# LLAMA
###

# Clone Llama
git clone https://github.com/ggml-org/llama.cpp.git

# Make change to disable grouping feature
# NOTE: I had to make a few other changes in Vulkan file to disable `VK_EXT_layer_settings`
#       These aren't supported on the vulkan-dev package that's available in the Debian repo. I've read that this shouldn't impact performance, but maybe it's the reason why it's slow for me?

# Build Llama.
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j 8

# Run benchmark
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-bench -m ../llm/qwen2.5-3b-instruct-q4_0.gguf -pg 128,128 -t 8 -ngl 1000

Unfortunately, I couldn’t get close to the speeds @Robin_binbin got ( Does Vulkan work? - #26 by Robin_binbin ):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |           pp512 |         18.35 ± 0.04 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |           tg128 |         16.47 ± 0.02 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | Vulkan     | 1000 |       8 |     pp128+tg128 |         16.53 ± 0.01 |

@Robin_binbin Was there anything else you might’ve done? I did notice that vulkaninfo seems to crash for me still:

radxa@orion-o6:~/Projects/llama.cpp$ vulkaninfo
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.239

# OMITTED FOR BREVITY

ERROR: [Loader Message] Code 0 : vkCreateDevice:  Failed to create device chain.
ERROR at ./vulkaninfo/vulkaninfo.h:1362:vkCreateDevice failed with ERROR_INITIALIZATION_FAILED
vulkaninfo: ./vulkaninfo/outputprinter.h:200: Printer::~Printer(): Assertion `!object_stack.empty() && "mismatched number of ObjectStart/ObjectEnd or ArrayStart/ArrayEnd's"' failed.
Aborted

Note that I did have to disable VK_EXT_layer_settings to get to compile with the Vulkan SDK that comes stock in Debian Repos.

KevinLi · December 31, 2025, 5:05am

I made the following modification to llama.cpp which impoves the prefill score.

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 3019a545d..53dbf18b8 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3030,29 +3030,23 @@ static vk_device ggml_vk_get_device(size_t idx) {
                 pipeline_robustness = true;
             } else if (strcmp("VK_EXT_subgroup_size_control", properties.extensionName) == 0) {
                 device->subgroup_size_control = true;
-#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT")) {
                 device->coopmat_support = true;
                 device->coopmat_m = 0;
                 device->coopmat_n = 0;
                 device->coopmat_k = 0;
-#endif
-#if defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
             } else if (strcmp("VK_NV_cooperative_matrix2", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT2")) {
                 coopmat2_support = true;
-#endif
 #if defined(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_shader_integer_dot_product", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_INTEGER_DOT_PRODUCT")) {
                 device->integer_dot_product = true;
 #endif
-#if defined(GGML_VULKAN_BFLOAT16_GLSLC_SUPPORT)
             } else if (strcmp("VK_KHR_shader_bfloat16", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_BFLOAT16")) {
                 bfloat16_support = true;
-#endif
             }
         }

Regarding vulkaninfo, I note you executed the command:

apt upgrade

This may have upgraded some vulkan related libs on your system. As llama.cpp works well with mali vulkan driver, I think you might try to specify vulkan icd to mali when runing vulkaninfo:

export VK_ICD_FILENAMES=/etc/vulkan/icd.d/mali.json

Robin_binbin · January 5, 2026, 2:37am

Forget to mention I have applied Kevin’s patch in llama.cpp as what he discribed below.

jimhamiru · January 5, 2026, 11:42pm

Thanks for the details, appreciate it! And sorry for late reply - have been sick the past week.

I did have another play around with all this yesterday but, unfortunately, am still stuck at very low PP (~10tps for 3B Q4). I’m not too sure why yet, but I did try a few other things:

Built VulkanSDK for latest glslc which correctly auto-detects the Vulkan extensions above (GML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT , etc) For context, with the latest Git versions of Llama.cpp, I had a bit of trouble building and there’s a many #if defined … statements for the extensions in the ggml-vulkan.cpp file.
Made sure to export export VK_ICD_FILENAMES=/etc/vulkan/icd.d/mali.json - which DOES fix the vulkaninfo issue.
Probably a lot of other things too - but I can’t recall them all sorry.

When I get some time, I might try again from scratch with the Default Radxa Image… it’s possible, with all my tinkering, I’ve messed something up.

I do think the Cix GO driver itself is working though - using the monitoring tool mentioned in their docs, I was able to see the Mali GPU Cores all go to 100% while Llama was running, so I suspect the issue is probably somewhere in the Llama.cpp version I’m using.

jimhamiru · February 4, 2026, 4:44am

There is a Llama.cpp PR for Mali G720 tuning.

github.com/ggml-org/llama.cpp

vulkan: add peak performance tuning for ARM Mali GPUs (G720)

master ← Gong-Mi:mali-g720-tuning

opened 04:05PM - 30 Dec 25 UTC

Gong-Mi

+48 -0

### Description This PR implements specialized performance tuning for ARM Mali G…720 GPUs (vendor ID 0x13B5) in the Vulkan backend. ### Changes - Implement specialized warptile configurations for Mali G720 to optimize compute throughput. - Force FP32 path by disabling FP16/BF16 based on experimental findings for peak performance on this architecture. - Limit suballocation block size to 256MB to improve memory stability on mobile devices within the Termux environment. ### Performance Findings (Termux Native Build) - **1B Models (e.g., Llama 3.2 1B)**: Show significant performance advantages when fully offloaded to the GPU. - **4B/8B Models**: Require partial CPU offloading to manage memory constraints effectively on typical mobile RAM configurations. - **Benchmark Command**: `llama-bench -m models/llama-3.2-1b.gguf -p 512 -n 128 -t 4` These optimizations aim to provide a more usable experience for Android users running LLMs locally via Vulkan on modern Dimensity/Mali-based SoCs.

I haven’t tried this yet, but might improve performance.