OpenAi Whisper ASR

stuartiannaylor · October 24, 2022, 5:57pm

I have been thinking of trying to get the NPU & GPU into play with ASR but got sidetracked with a CPU based ASR lib of OPenAi’s GPT2 that really its amazing it runs on CPU but it does due to this great repo.

I thought I would post as was more than happy with the results against a Pi4
My Rock5b

rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf                                                                                                                                   k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang =                                                                                                                                    en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your                                                                                                                                    country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   313.91 ms
whisper_print_timings:      mel time =   107.60 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6165.18 ms / 1027.53 ms per layer
whisper_print_timings:   decode time =   657.71 ms / 109.62 ms per layer
whisper_print_timings:    total time =  7256.87 ms

Pi4

pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1851.33 ms
whisper_print_timings:      mel time =   270.67 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings:   decode time =  1287.69 ms / 214.61 ms per layer
whisper_print_timings:    total time = 37281.19 ms

Rock5b 5.137 times faster than a Pi4 and haven’t even got round to using the NPU/GPU as still reading up on rknn-toolkit2 but whatever it is the above seems to favor the RK3588 when it comes to the cpu.
Yeah I am cheating slightly as loading from NVME but you can see the load time still doesn’t have that much effect.
I know the above has been optimised for ARM8.2 architecture presumably because of the new Macs, so the x3 perf over a Pi4 might be selling things short.

dnhkng · March 16, 2023, 5:19pm

any updates on NPU use? This could get super interesting, especially with Llama!

phiber · June 24, 2023, 12:02am

would be interested in the NPU usage for this too

EtAiros · June 25, 2023, 9:17am

have you tried llama.cpp on rock5b？

incognito · June 25, 2023, 10:26am

I tried llama and alpaca cpp. They work but need a lot of compute power (especially since I run them on the CPU). Don’t know how to use NPU for them, if it’s even possible.

stuartiannaylor · June 25, 2023, 11:40pm

https://github.com/ggerganov/llama.cpp is amazingly optimised and due to the repo focus ’ * Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks’
It can use the MaliG610 with openCL but is slower than the CPU but could garner improvements as think the CSF goes in Linux 6.4 https://www.collabora.com/news-and-blog/news-and-events/pancsf-a-new-drm-driver-for-mali-csf-based-gpus.html

The NPU like many has its own framework and would require specific RKnpu code and Linux would really benefit if it had something more close to Androids NNapi.
Also its not a 6tops its a 3x core 2tops NPU where its rating is with use of the small reserved area https://github.com/rockchip-linux/rknpu2/blob/master/doc/RK3588_NPU_SRAM_usage.md

Llama is not going to run on the RK3588 NPU and the model zoo is indicative of model size limit and even though daily newer, better and faster models are coming out.
The RK3588 being Arm V8.2 absolutely smashes Pi4’s for ML so Llama.cpp is usable and in benches punches well above its weight.
I have never been a fan of Apple bling but the M1/2 mini is truly awesome for 6.8watt idle where it can run ultra fast in a client/server race-till-idle to provide very little percieved latency.
You could do the same with a RK3588 and cover multiple zones as the diversification of use would clash infrequently and just be a longer latency (wait).
ASR->LLM->TTS is sequential and much is ram as load isn’t concurrent so guess also you could partition to specific devices.

Also I keep looking for an alternative for Whisper as its this great all-in-one multi-lang ASR but the smaller models rocket in WER as do quite a few languages that it ‘supports’.
Its the Large model WER that is state-of-the-art but when you get to Small or below with certain langs there are better alternatives that just lack the focus Whisper recieved that don’t have amazing repos like GGML.

Compared to the CPU & GPU the NPU is actually much less than it 1st seems, but the CPU for wattage is godlike.