NPU Reverse engineering

I started reverse engineering the NPU, have a simple matrix multiplication working. If of interest short write about the NPU and capabilities/limitations here .


Did you manage to bench the actual tops/gops? I think 2tops per core maybe optimistic but maybe that is 4bit utilising a large portion of the sram allocation in
I am expecting the reality to be much smaller.
I didn’t even realise it could address 4gb and thought it was far more limited.

ArmNN is pretty interesting as with the TF delegate you can switch from CPU to GPU where the GPU seems to be approx 75% of what the mat/mul vector instructions of the A76 can do.
This might be hampered by the OpenCl interface that ArmmNN uses.
The Mali710 (610 is just the <= 6 core) is quite impressive and NPU/TPUs all seem to be the same in advertising big Ops numbers but the reality when applied to real models is often much less.
Also the frameworks are often standalone and lack any effective api and so far not overly impressed by any TPU/NPU that I have seen.

I would love it if Rockchip did what Qualcom did with a new bigger RK3588 of just 8x Big cores and a Mali710 with a decent ammount of cores (My pixel6a is MP20…)
Both share Arms unified memory and that could be such a sweet spot for running edge LLMs that do have APIs such as Llama.cpp that is implementing Vulkan.
NCNN also implements Vulkan and maybe if Arm and Panthor roles out I will also be able to test, the RK3588 CPU vs GPU.

Its great to have a NPU that can run small CNN Yolo type models, but that is a considerable distance to running LLMs. The CPU does a great job running some of the smaller Llama.cpp quants and if Vulkan becomes a thing then the GPU should be able to add to that.
I would love to see a SoC with a bigger core count on Big CPU and GPU as the sweet spot for many LLM’s is not that far above the current RK3588.

1 Like