OpenAi Whisper ASR

I have been thinking of trying to get the NPU & GPU into play with ASR but got sidetracked with a CPU based ASR lib of OPenAi’s GPT2 that really its amazing it runs on CPU but it does due to this great repo.

I thought I would post as was more than happy with the results against a Pi4
My Rock5b

rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf                                                                                                                                   k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang =                                                                                                                                    en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your                                                                                                                                    country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   313.91 ms
whisper_print_timings:      mel time =   107.60 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6165.18 ms / 1027.53 ms per layer
whisper_print_timings:   decode time =   657.71 ms / 109.62 ms per layer
whisper_print_timings:    total time =  7256.87 ms

Pi4

pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1851.33 ms
whisper_print_timings:      mel time =   270.67 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings:   decode time =  1287.69 ms / 214.61 ms per layer
whisper_print_timings:    total time = 37281.19 ms

Rock5b 5.137 times faster than a Pi4 and haven’t even got round to using the NPU/GPU as still reading up on rknn-toolkit2 but whatever it is the above seems to favor the RK3588 when it comes to the cpu.
Yeah I am cheating slightly as loading from NVME but you can see the load time still doesn’t have that much effect.
I know the above has been optimised for ARM8.2 architecture presumably because of the new Macs, so the x3 perf over a Pi4 might be selling things short.

any updates on NPU use? This could get super interesting, especially with Llama!