I have been thinking of trying to get the NPU & GPU into play with ASR but got sidetracked with a CPU based ASR lib of OPenAi’s GPT2 that really its amazing it runs on CPU but it does due to this great repo.
I thought I would post as was more than happy with the results against a Pi4
My Rock5b
rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 313.91 ms
whisper_print_timings: mel time = 107.60 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 6165.18 ms / 1027.53 ms per layer
whisper_print_timings: decode time = 657.71 ms / 109.62 ms per layer
whisper_print_timings: total time = 7256.87 ms
Pi4
pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 1851.33 ms
whisper_print_timings: mel time = 270.67 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings: decode time = 1287.69 ms / 214.61 ms per layer
whisper_print_timings: total time = 37281.19 ms
Rock5b 5.137 times faster than a Pi4 and haven’t even got round to using the NPU/GPU as still reading up on rknn-toolkit2 but whatever it is the above seems to favor the RK3588 when it comes to the cpu.
Yeah I am cheating slightly as loading from NVME but you can see the load time still doesn’t have that much effect.
I know the above has been optimised for ARM8.2
architecture presumably because of the new Macs, so the x3 perf over a Pi4 might be selling things short.