When using the userspace stack from the Khadas images, and the performance governor but not taskset, the result is just exactly the same as CNX Software got—just a tad above 4000. Using taskset -c 4-7 on gnome-shell and glmark2-es2-wayland improves the result by about 10%, and the rest of the gain is from switching the compositor to a newer mutter from Debian sid.
For the result you get with the Debian image, a lot of that is just because the mali blob is not well integrated with Xorg. There seem to be some hacks in the Wayland blob to improve performance beyond what should be possible given the kernel limitations, possibly at the cost of latency.
I can try to look to see why ArmNN doesn’t work as well as you expect it to…