Ubuntu 20.02 & Mali Drivers

Have you managed to add the fixes to ppa:liujianfeng1994/panfork-mesa as would love to get an image running with wayland and mutter.
Not so bothered about the 10% of the affinity as the cnx example doesn’t seem to do so but often its a case of robbing Paul to pay Peter as someway along the line something is going to lose out to the normal load balancing of the scheduler.

Yeah you can give ArmNN a go as prob its a really good bench as benches go with a finite model and the ability to flick between GpuAcc & CpuAcc on the cli.
Its https://developer.arm.com/documentation/102603/2108/Device-specific-installation/Install-on-Odroid-N2-Plus
Also like what seem to be all Arm software doc’s its slightly out of date and you need to debug as you go but you are obviously more than capable.

sudo apt-get install -y python3-pyarmnn armnn-latest-all than sudo apt-get install -y python3-pyarmnn libarmnn-latest-all

python3 run_audio_file.py --audio_file_path tests/testdata/quick_brown_fox_16000khz.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends CpuAcc CpuRef than python3 run_audio_file.py --audio_file_path tests/testdata/quick_brown_fox_16000khz.wav --model_file_path tflite_int8/wav2letter_int8.tflite --labels_file_path tests/testdata/wav2letter_labels.txt --preferred_backends CpuAcc CpuRef as the labels are in the python code.

I don’t think it will matter as linking to the x11 libmali-valhall-g610-g6p0-x11 just works bad as x11 does but presume libmali-valhall-g610-g6p0-wayland or if libmali-valhall-g610-g6p0-x11-wayland was linked to the icd loader instead then like wayland overall results would be better.
I am completely lost with the armbian image trying to work backwards what was actually driving Dri with everything that is installed in the base image is confusing.
It would be really interesting to be able to link against the wayland driver than x11


That ArmNN sample seems to spend a lot of time doing preprocessing of audio data. It’s written in Python, which isn’t exactly the fastest language in the world, but it also does a number of things which cause significant slowdown: doing element-wise calculation rather than doing operations on whole arrays at once, and recalculating constant data.

Fixing those significantly improves speed compared to realtime on a five-minute test file:

CpuAcc GpuAcc
Before 3.3x 3.1x
After 26.2x 17.7x

The GPU backend has some constant overhead, so testing again with a one-hour file to be more fair:

CpuAcc GpuAcc
After 28.1x 25.6x

(I’m not waiting forty minutes to test the hour-long file with the old code.)

There is also a C++ sample which is similar, but it also does a number of things that hurt performance: implementing a “fast” Fourier transform naïvely, and doing too high quality resampling.

Again, big speed improvements from fixing those:

One-minute file:

CpuAcc GpuAcc
Before 1.5x 1.3x
After 32.3x 8.7x

One-hour file:

CpuAcc GpuAcc
After 40.9x 35.5x

However, both CPU and GPU performance could still be improved—it appears that only four cores are used for CPU processing, and the GPU load is only 60% even after my tweaks. This could be fixed by processing multiple blocks of audio at the same time.

I haven’t yet investigated why the GPU backend is not faster than the CPU—maybe the model is just too small for the GPU to show much benefit?

Here are the patches I used:

armnn-patches.zip (2.5 KB)


The GPU backend has about a 5 sec load that the CPU doesn’t.

Plus yeah Python isn’t the fastest and it really sucks when iterating DSP like data.
I hacked the code so the model stays in memory and it runs twice and that takes off quite a bit, well the 5 sec delay that seems constant.

The code and model is actually pretty terrible but it was just purely testing ArmNN I should of done the same with the mfcc as load once run twice.

# Copyright © 2021 Arm Ltd and Contributors. All rights reserved.
# SPDX-License-Identifier: MIT

"""Automatic speech recognition with PyArmNN demo for processing audio clips to text."""

import sys
import os
import numpy as np
import psutil
script_dir = os.path.dirname(__file__)
sys.path.insert(1, os.path.join(script_dir, '..', 'common'))

from argparse import ArgumentParser
from network_executor import ArmnnNetworkExecutor
from utils import prepare_input_data
from audio_capture import AudioCaptureParams, capture_audio
from audio_utils import decode_text, display_text
from wav2letter_mfcc import Wav2LetterMFCC, W2LAudioPreprocessor
from mfcc import MFCCParams
from datetime import datetime

# Model Specific Labels
labels = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm',
          13: 'n',
          14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',
          25: 'z',
          26: "'", 27: ' ', 28: '$'}

def parse_args():
    parser = ArgumentParser(description="ASR with PyArmNN")
        help="Path to the audio file to perform ASR",
        help="Path to ASR model to use",
        default=["GpuAcc", "CpuAcc", "CpuRef"],
        help="""List of backends in order of preference for optimizing
        subgraphs, falling back to the next backend in the list on unsupported
        layers. Defaults to [GpuAcc, CpuAcc, CpuRef]""",
    return parser.parse_args()

def main(args, network):
    # Read command line args
    audio_file = args.audio_file_path
    print(datetime.now() - starttime, psutil.cpu_percent())

    print(datetime.now() - starttime, psutil.cpu_percent())
    # Specify model specific audio data requirements
    audio_capture_params = AudioCaptureParams(dtype=np.float32, overlap=31712, min_samples=47712, sampling_freq=16000,

    buffer = capture_audio(audio_file, audio_capture_params)
    print(datetime.now() - starttime, psutil.cpu_percent())
    # Extract features and create the preprocessor

    mfcc_params = MFCCParams(sampling_freq=16000, num_fbank_bins=128, mel_lo_freq=0, mel_hi_freq=8000,
                             num_mfcc_feats=13, frame_len=512, use_htk_method=False, n_fft=512)

    print(datetime.now() - starttime, psutil.cpu_percent())
    wmfcc = Wav2LetterMFCC(mfcc_params)
    preprocessor = W2LAudioPreprocessor(wmfcc, model_input_size=296, stride=160)
    current_r_context = ""
    is_first_window = True

    print("Processing Audio Frames...")
    for audio_data in buffer:
        # Prepare the input Tensors
        input_data = prepare_input_data(audio_data, network.get_data_type(), network.get_input_quantization_scale(0),
                                        network.get_input_quantization_offset(0), preprocessor)

        # Run inference
        output_result = network.run([input_data])

        # Slice and Decode the text, and store the right context
        current_r_context, text = decode_text(is_first_window, labels, output_result)

        is_first_window = False

        print(datetime.now() - starttime, psutil.cpu_percent())

    print(current_r_context, flush=True)
    print(datetime.now() - starttime, psutil.cpu_percent())
    print("Inference End", psutil.cpu_percent())

if __name__ == "__main__":
    args = parse_args()
    print("Inference Start", psutil.cpu_percent())
    starttime = datetime.now()
    # Create the ArmNN inference runner
    network = ArmnnNetworkExecutor(args.model_file_path, args.preferred_backends)
    print(datetime.now() - starttime, psutil.cpu_percent())
    main(args, network)
    starttime = datetime.now()
    print(datetime.now() - starttime, psutil.cpu_percent())
    main(args, network)

I was using these as longer but not so long from https://github.com/ggerganov/whisper.cpp which is a cpu version of OPenAI’s Whisper based on his own tensor lib which is interesting

# Audio samples

# download a few audio samples into folder "./samples":
.PHONY: samples
	@echo "Downloading samples..."
	@mkdir -p samples
	@wget --quiet --show-progress -O samples/gb0.ogg https://upload.wikimedia.org/wikipedia/commons/2/22/George_W._Bush%27s_weekly_radio_address_%28November_1%2C_2008%29.oga
	@wget --quiet --show-progress -O samples/gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
	@wget --quiet --show-progress -O samples/hp0.ogg https://upload.wikimedia.org/wikipedia/en/d/d4/En.henryfphillips.ogg
	@wget --quiet --show-progress -O samples/mm1.wav https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav
	@echo "Converting to 16-bit WAV ..."
	@ffmpeg -loglevel -0 -y -i samples/gb0.ogg -ar 16000 -ac 1 -c:a pcm_s16le samples/gb0.wav
	@ffmpeg -loglevel -0 -y -i samples/gb1.ogg -ar 16000 -ac 1 -c:a pcm_s16le samples/gb1.wav
	@ffmpeg -loglevel -0 -y -i samples/hp0.ogg -ar 16000 -ac 1 -c:a pcm_s16le samples/hp0.wav
	@ffmpeg -loglevel -0 -y -i samples/mm1.wav -ar 16000 -ac 1 -c:a pcm_s16le samples/mm0.wav
	@rm samples/mm1.wav

I will give them another go even though the model we have is pretty bad https://github.com/breizhn/DTLN might be better as its also 2 models and that would be really interesting to use GpuAcc on one and CpuAcc on another but from the load it looks almost like GpuAcc is more of a helper than removing most load from Cpu such as Cuda.
You should also be able to take a single big model and partition the layers and run with 2x delegates which is what I was wondering.
Its a shame RkNN didn’t go the delegate route.

I was lazy as it chunks the audio through the mfcc and very possible to just convert the whole audio and chunk the premade mfcc through the model, but didn’t bother.
May do as the interest is purely in the delegate and gpu vs cpu not some horrid mfcc.

Its strange as tensorflow has mfcc ops and you can subclass them into a model as does https://github.com/google-research/google-research/tree/master/kws_streaming or even librosa is more performant than the armnn example
Thanks though as now it makes so much sense why the cpu is being hit.

As is with samples/hp0.wav

GpuAcc CpuAcc CpuRef
0:01:29.798073 0.0
Inference End 0.0
CpuAcc CpuRef
0:01:33.897910 0.0
Inference End 0.0

sudo apt-get install irqbalance I did notice the irq affinity is near all on core0 and irqbalance does swap to core4 (big) [But not a lot of difference]

GpuAcc CpuAcc CpuRef
0:01:29.660175 0.0
Inference End 0.0
CpuAcc CpuRef
0:01:31.325590 50.0
Inference End 0.0
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor
GpuAcc CpuAcc CpuRef
0:01:24.576257 0.0
Inference End 0.0
CpuAcc CpuRef
0:01:29.003342 0.0
Inference End 0.0

I will stop being lazy and do a horrid hack out of the mffc and preprocess it.
Nope dunno what or where utils is and the internet is playing up so will be lazy and try something better.

From dumping the GPU shader assembly it does appear to be using the IDP instructions for 8-bit dot product, so there isn’t anything obviously wrong there.

This is I think the theoretical performance of 8-bit operations, in multiplications per nanosecond:

A55: 115 (four cores)
A76: 307 (four cores)
GPU: 1024
NPU: 3072

So the GPU should be faster than the CPU, but I guess that a larger model is needed to actually see that.

@icecream95 are the cpu calcs for cpu and not neon as we can test that also.
GpuAcc = Gpu, CpuAcc = Really is Neon, CpuRef = Cpu

I slept and had a look at the code and thought it would be easy.
It processes all audio into an array 1st so you may think has froze and then just runs the model.

python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends CpuAcc CpuRef

cpu seems to take approx 05.517623s and approx 45% load
gpu seems to take approx :06.291094 and approx 5% cpu load 75% gpu load

That is with a hacky python script feeding it but now very little in the inference loop.
There is some overhead of armnn and opencl but this is exactly what I wanted to check.
The rk3588 really is a power house for ml as its Mali mp4 is almost a perfect match for CPU
So we can run 2x models that are approx 2x the load of current and would be interesting to see how the mali also reacts with bigger models.

I had a go with tensorflow TTS as its a much heavier load but working out how to slot in armnn and the quantisation specifics is going to be bigger dev chore than what the above can demo.

# Copyright © 2021 Arm Ltd and Contributors. All rights reserved.
# SPDX-License-Identifier: MIT

"""Automatic speech recognition with PyArmNN demo for processing audio clips to text."""

import sys
import os
import numpy as np
import psutil
import soundfile as sf
script_dir = os.path.dirname(__file__)
sys.path.insert(1, os.path.join(script_dir, '..', 'common'))

from argparse import ArgumentParser
from network_executor import ArmnnNetworkExecutor
from utils import prepare_input_data
from audio_capture import AudioCaptureParams, capture_audio
from audio_utils import decode_text, display_text
from wav2letter_mfcc import Wav2LetterMFCC, W2LAudioPreprocessor
from mfcc import MFCCParams
from datetime import datetime, timedelta

# Model Specific Labels
labels = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm',
          13: 'n',
          14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',
          25: 'z',
          26: "'", 27: ' ', 28: '$'}

def time_float(result):
    seconds = int(result)
    microseconds = int((result * 1000000) % 1000000)
    output = timedelta(0, seconds, microseconds)
    return output

def parse_args():
    parser = ArgumentParser(description="ASR with PyArmNN")
        help="Path to the audio file to perform ASR",
        help="Path to ASR model to use",
        default=["GpuAcc", "CpuAcc", "CpuRef"],
        help="""List of backends in order of preference for optimizing
        subgraphs, falling back to the next backend in the list on unsupported
        layers. Defaults to [GpuAcc, CpuAcc, CpuRef]""",
    return parser.parse_args()

def main(args, network, input_data):

    current_r_context = ""
    is_first_window = True
    avg_cpu = 0.0
    for input_chunk in input_data:
        # Run inference
        output_result = network.run([input_chunk])

        # Slice and Decode the text, and store the right context
        current_r_context, text = decode_text(is_first_window, labels, output_result)

        is_first_window = False

        runtime = datetime.now() - starttime
        print(" " + str(runtime))
        avg_cpu = avg_cpu + psutil.cpu_percent()

    print(current_r_context, flush=True)
    print("Inference End: Avg CPU%=" + str(avg_cpu / len(input_data)))
    return runtime

if __name__ == "__main__":
    args = parse_args()
    # Create the ArmNN inference runner
    network = ArmnnNetworkExecutor(args.model_file_path, args.preferred_backends)
    # Read command line args
    audio_file = args.audio_file_path
    sf_data, samplerate = sf.read(audio_file)
    sf_secs = time_float((len(sf_data) / samplerate))
    # Specify model specific audio data requirements
    audio_capture_params = AudioCaptureParams(dtype=np.float32, overlap=31712, min_samples=47712, sampling_freq=16000,

    buffer = capture_audio(audio_file, audio_capture_params)
    # Extract features and create the preprocessor

    mfcc_params = MFCCParams(sampling_freq=16000, num_fbank_bins=128, mel_lo_freq=0, mel_hi_freq=8000,
                             num_mfcc_feats=13, frame_len=512, use_htk_method=False, n_fft=512)

    wmfcc = Wav2LetterMFCC(mfcc_params)
    preprocessor = W2LAudioPreprocessor(wmfcc, model_input_size=296, stride=160)   
    print("Processing Audio Frames...")
    input_data = []

    for audio_data in buffer:
        # Prepare the input Tensors
        input_data.append(prepare_input_data(audio_data, network.get_data_type(), network.get_input_quantization_scale(0),
                                        network.get_input_quantization_offset(0), preprocessor))
    starttime = datetime.now()
    runtime = main(args, network, input_data)
    print("Runtime=" + str(runtime))
    print("Realtime=x" + str(sf_secs / runtime))
    starttime = datetime.now()
    runtime = main(args, network, input_data)
    print("Runtime=" + str(runtime))
    print("Realtime=x" + str(sf_secs / runtime))
rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends CpuAcc CpuRef
Your ArmNN library instance does not support Onnx models parser functionality.  Skipped IOnnxParser import.
Can't load libOpenCL.so: libOpenCL.so: cannot open shared object file: No such file or directory
Can't load libGLES_mali.so: libGLES_mali.so: cannot open shared object file: No such file or directory
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Preferred backends: ['CpuAcc', 'CpuRef']
IDeviceSpec { supportedBackends: [CpuAcc, CpuRef, GpuAcc]}
Optimization warnings: ()
Processing Audio Frames...
 henry f 0:00:00.039640
 phillips 0:00:00.081647
 from wic 0:00:00.111952
epedeia 0:00:00.134524
 the free ind cy 0:00:00.153441
ycliopedioa 0:00:00.172588
 at e 0:00:00.191404
 and d 0:00:00.210939
ot we cud 0:00:00.230029
pedia tha 0:00:00.248594
t org 0:00:00.266827
 henry 0:00:00.305864
 f philip 0:00:00.324896
s 0:00:00.344135
 from  0:00:00.363412
 wickepedia  0:00:00.381937
 the freie andcyclo 0:00:00.401530
opedia 0:00:00.420985
 henry afh 0:00:00.458467
e philips 0:00:00.477728
 eighteen ni 0:00:00.495774
ney to nin 0:00:00.513933
neteen fifty e 0:00:00.532492
ight 0:00:00.551160
 a youas  0:00:00.570730
 businessman 0:00:00.589955
 from portland 0:00:00.608142
d or again 0:00:00.627386
 a 0:00:00.645696
s the honor 0:00:00.664017
 of having the phi 0:00:00.683096
lip's head scre 0:00:00.702685
w and sc 0:00:00.730444
crew driver  0:00:00.752651
 named ater hi 0:00:00.774348
m 0:00:00.793570
 the importance  0:00:00.831017
 of the cross had i 0:00:00.849545
 screw design 0:00:00.868613
 lies 0:00:00.887060
 in its t self ce 0:00:00.908571
entering property 0:00:00.927869
y u 0:00:00.946793
useful on ad 0:00:00.964844
imated production 0:00:00.984788
n lines t 0:00:01.003689
he use powered 0:00:01.022886
d screw drivers 0:00:01.044580
 philip 0:00:01.085195
's major contrib 0:00:01.105548
bution was i 0:00:01.124491
n driving the cro 0:00:01.142652
oss head concept 0:00:01.163555
 forward 0:00:01.182046
 to the point 0:00:01.220041
 where it was a 0:00:01.256361
dopted by scre 0:00:01.278730
ew makers an 0:00:01.298773
nd automabiele c 0:00:01.317458
companiese 0:00:01.336092
 although he recei 0:00:01.373344
ved patents for the 0:00:01.391744
 design in 0:00:01.410014
 nineten thirty  0:00:01.428450
six 0:00:01.449285
 you as pate 0:00:01.467786
nt number  0:00:01.486576
 to million 0:00:01.505696
 forty six 0:00:01.535174
x thousand 0:00:01.563460
 three hundred f 0:00:01.586442
forty three 0:00:01.606334
 you es 0:00:01.625209
 patents t 0:00:01.643606
oo million 0:00:01.661801
 forty six thou 0:00:01.681349
usand aig 0:00:01.701606
he hundred thirty se 0:00:01.734885
ven to t 0:00:01.758276
two million  0:00:01.778433
 forty six thou 0:00:01.797458
sand eight  0:00:01.815965
hundered forty 0:00:01.835066
 it was so wi 0:00:01.871975
dely copied 0:00:01.890931
 that by nin 0:00:01.909609
neteen forty ni 0:00:01.928052
ne phili 0:00:01.946818
ips lost his p 0:00:01.965393
partent 0:00:01.983982
 the american  0:00:02.024593
 screw company 0:00:02.045068
 was respon 0:00:02.076243
nsible for devis 0:00:02.105305
sing a means  0:00:02.126591
 of manufactu 0:00:02.145923
ring the screw 0:00:02.164145
 and  0:00:02.182812
successfully pa 0:00:02.204005
tented and  0:00:02.222751
licence that the are 0:00:02.241392
 method 0:00:02.260188
 o 0:00:02.278339
others screw makers 0:00:02.298520
 of the nineteen h th 0:00:02.316845
hirties dis 0:00:02.335092
missed the phillip's c 0:00:02.353498
concept  0:00:02.371793
since it calls for  0:00:02.389805
 relatively co 0:00:02.411886
mplex r 0:00:02.430947
resisets ocke 0:00:02.450992
et sheepe i 0:00:02.470813
in the head of the scr 0:00:02.489585
rew 0:00:02.507690
 as disti 0:00:02.526146
inct from the simple 0:00:02.544312
e milled slaught 0:00:02.562836
 of a slaugh 0:00:02.583057
hted type scre 0:00:02.612682
w 0:00:02.642730
 the philip' 0:00:02.666383
s  screw compa 0:00:02.686970
any and  0:00:02.705801
 the american sc 0:00:02.723954
rew company 0:00:02.743269
 went on  0:00:02.761700
 to devise  0:00:02.779779
the posi drive 0:00:02.799883
e screw 0:00:02.818531
 which  0:00:02.837064
 differs from the ph 0:00:02.855034
hilips  0:00:02.873806
 in that it is de 0:00:02.897096
signed to accomo 0:00:02.917923
dae greaterd t 0:00:02.938485
ork than the phi 0:00:02.957239
lips 0:00:02.976965
 and image a  0:00:03.014475
companied this arti 0:00:03.033712
icol  0:00:03.071588
 caption 0:00:03.109826
 philips  0:00:03.139520
 screw head 0:00:03.161752
 the followin 0:00:03.218120
g is an infu bo 0:00:03.236750
x which ac 0:00:03.256835
ccompanies this ar 0:00:03.276173
rticle 0:00:03.294907
 in  0:00:03.313449
fu box 0:00:03.332069
 part o thes  0:00:03.350309
 series un 0:00:03.369606
screw drive 0:00:03.388178
 types 0:00:03.406281
 slaughted 0:00:03.444362
 commonl 0:00:03.462749
y eroneou 0:00:03.481221
usly fla 0:00:03.500575
athead 0:00:03.519235
 phylips 0:00:03.560982
 cross he 0:00:03.583342
ad 0:00:03.603654
 pasierive 0:00:03.643562
 super  0:00:03.662293
drive 0:00:03.680999
 tokgs 0:00:03.718740
 ha 0:00:03.756135
cx a 0:00:03.776041
len 0:00:03.794814
 roberts son 0:00:03.836249
try wing 0:00:03.893283
 tark 0:00:03.936371
 set 0:00:03.954823
 span er head 0:00:03.991776
 triple square 0:00:04.051670
e ex 0:00:04.070651
sy nd 0:00:04.093896
 ot 0:00:04.137063
hers 0:00:04.155997
 polly drives 0:00:04.174308
 sp 0:00:04.193733
linmde drive 0:00:04.212975
 double 0:00:04.231921
e hacks 0:00:04.250232
 many images ac 0:00:04.287337
ccompanyed this in 0:00:04.305945
pu box 0:00:04.326572
 this 0:00:04.345428
 page was last 0:00:04.363623
t modified 0:00:04.382075
 on the ninth of va 0:00:04.400191
april two  0:00:04.418365
 thousand aeight 0:00:04.437033
 at s 0:00:04.455600
seventeen o 0:00:04.474167
 for 0:00:04.492340
 all te 0:00:04.530033
xt as avaivlable 0:00:04.549138
 under the term 0:00:04.595112
ms of thei ganew 0:00:04.630920
 free document 0:00:04.650296
tation licens 0:00:04.668415
 sea 0:00:04.686839
 copyrites  0:00:04.705349
 for details 0:00:04.725088
 wichpedia 0:00:04.766718
 is aregister 0:00:04.786224
n trade mark  0:00:04.804714
 of the wikie mmede 0:00:04.822999
ea foundation 0:00:04.841733
 incorporated 0:00:04.859885
 a  0:00:04.878413
 eu as registrud 0:00:04.897591
d fival 0:00:04.921186
 one sea  0:00:04.942560
 three  0:00:04.961235
 tax theductable 0:00:04.979391
 non profhet c 0:00:04.997482
harity 0:00:05.016022
this sound fi 0:00:05.054727
le and all 0:00:05.076378
 text in the artic 0:00:05.097072
cle or li 0:00:05.115830
cense under  0:00:05.134143
 the thenew fre 0:00:05.152953
e documentation 0:00:05.171724
n license 0:00:05.190068
 availabl 0:00:05.212038
le at  0:00:05.231690
oubl you doubleyou  0:00:05.250553
 dw do 0:00:05.270274
t g 0:00:05.290077
 and you 0:00:05.308365
u dot 0:00:05.327549
 horg 0:00:05.345873
 slash  0:00:05.364095
 cope left 0:00:05.382992
 slash 0:00:05.401753
 f dee 0:00:05.420065
d el 0:00:05.439199
 dout each t 0:00:05.457707
ea m l 0:00:05.476005

Inference End: Avg CPU% 44.152573529411804
 henry f 0:00:00.029676
 phillips 0:00:00.054872
 from wic 0:00:00.076020
epedeia 0:00:00.094366
 the free ind cy 0:00:00.114368
ycliopedioa 0:00:00.133422
 at e 0:00:00.152289
 and d 0:00:00.183105
ot we cud 0:00:00.207808
pedia tha 0:00:00.229333
t org 0:00:00.250968
 henry 0:00:00.288826
 f philip 0:00:00.308294
s 0:00:00.326518
 from  0:00:00.347068
 wickepedia  0:00:00.371150
 the freie andcyclo 0:00:00.393486
opedia 0:00:00.414678
 henry afh 0:00:00.452419
e philips 0:00:00.479387
 eighteen ni 0:00:00.505646
ney to nin 0:00:00.528157
neteen fifty e 0:00:00.549082
ight 0:00:00.568212
 a youas  0:00:00.587044
 businessman 0:00:00.610898
 from portland 0:00:00.632699
d or again 0:00:00.652812
 a 0:00:00.672068
s the honor 0:00:00.692313
 of having the phi 0:00:00.711468
lip's head scre 0:00:00.730892
w and sc 0:00:00.749864
crew driver  0:00:00.768192
 named ater hi 0:00:00.786391
m 0:00:00.804390
 the importance  0:00:00.840970
 of the cross had i 0:00:00.860126
 screw design 0:00:00.879214
 lies 0:00:00.897920
 in its t self ce 0:00:00.917412
entering property 0:00:00.935788
y u 0:00:00.954098
useful on ad 0:00:00.973072
imated production 0:00:00.992050
n lines t 0:00:01.010948
he use powered 0:00:01.030059
d screw drivers 0:00:01.048675
 philip 0:00:01.085260
's major contrib 0:00:01.104387
bution was i 0:00:01.122887
n driving the cro 0:00:01.141004
oss head concept 0:00:01.159672
 forward 0:00:01.177971
 to the point 0:00:01.196020
 where it was a 0:00:01.214700
dopted by scre 0:00:01.233585
ew makers an 0:00:01.254226
nd automabiele c 0:00:01.273182
companiese 0:00:01.291665
 although he recei 0:00:01.328732
ved patents for the 0:00:01.347082
 design in 0:00:01.365269
 nineten thirty  0:00:01.384728
six 0:00:01.403558
 you as pate 0:00:01.422254
nt number  0:00:01.440639
 to million 0:00:01.458886
 forty six 0:00:01.477751
x thousand 0:00:01.496483
 three hundred f 0:00:01.515026
forty three 0:00:01.533806
 you es 0:00:01.552109
 patents t 0:00:01.570447
oo million 0:00:01.589292
 forty six thou 0:00:01.608702
usand aig 0:00:01.628919
he hundred thirty se 0:00:01.647763
ven to t 0:00:01.666474
two million  0:00:01.685872
 forty six thou 0:00:01.718653
sand eight  0:00:01.746268
hundered forty 0:00:01.765286
 it was so wi 0:00:01.803680
dely copied 0:00:01.827649
 that by nin 0:00:01.848720
neteen forty ni 0:00:01.867413
ne phili 0:00:01.888695
ips lost his p 0:00:01.907449
partent 0:00:01.926010
 the american  0:00:01.963833
 screw company 0:00:01.982087
 was respon 0:00:02.003408
nsible for devis 0:00:02.023105
sing a means  0:00:02.042951
 of manufactu 0:00:02.062428
ring the screw 0:00:02.083601
 and  0:00:02.104128
successfully pa 0:00:02.123177
tented and  0:00:02.141869
licence that the are 0:00:02.160193
 method 0:00:02.181144
 o 0:00:02.200624
others screw makers 0:00:02.219218
 of the nineteen h th 0:00:02.237288
hirties dis 0:00:02.255883
missed the phillip's c 0:00:02.274358
concept  0:00:02.294018
since it calls for  0:00:02.312455
 relatively co 0:00:02.330471
mplex r 0:00:02.349534
resisets ocke 0:00:02.367540
et sheepe i 0:00:02.387564
in the head of the scr 0:00:02.405687
rew 0:00:02.423716
 as disti 0:00:02.442485
inct from the simple 0:00:02.461059
e milled slaught 0:00:02.482157
 of a slaugh 0:00:02.500962
hted type scre 0:00:02.521853
w 0:00:02.542050
 the philip' 0:00:02.561338
s  screw compa 0:00:02.580732
any and  0:00:02.607533
 the american sc 0:00:02.630970
rew company 0:00:02.649956
 went on  0:00:02.668940
 to devise  0:00:02.687399
the posi drive 0:00:02.706255
e screw 0:00:02.737116
 which  0:00:02.761253
 differs from the ph 0:00:02.783470
hilips  0:00:02.801990
 in that it is de 0:00:02.820875
signed to accomo 0:00:02.841162
dae greaterd t 0:00:02.860114
ork than the phi 0:00:02.878703
lips 0:00:02.897341
 and image a  0:00:02.934287
companied this arti 0:00:02.952481
icol  0:00:02.971039
 caption 0:00:02.990690
 philips  0:00:03.010994
 screw head 0:00:03.029708
 the followin 0:00:03.086644
g is an infu bo 0:00:03.105915
x which ac 0:00:03.124921
ccompanies this ar 0:00:03.143613
rticle 0:00:03.162389
 in  0:00:03.180519
fu box 0:00:03.199188
 part o thes  0:00:03.217988
 series un 0:00:03.236804
screw drive 0:00:03.255463
 types 0:00:03.273704
 slaughted 0:00:03.310921
 commonl 0:00:03.329144
y eroneou 0:00:03.347206
usly fla 0:00:03.366471
athead 0:00:03.386305
 phylips 0:00:03.423253
 cross he 0:00:03.445383
ad 0:00:03.466505
 pasierive 0:00:03.504122
 super  0:00:03.522533
drive 0:00:03.541154
 tokgs 0:00:03.578647
 ha 0:00:03.618935
cx a 0:00:03.637462
len 0:00:03.656392
 roberts son 0:00:03.694608
try wing 0:00:03.761281
 tark 0:00:03.802532
 set 0:00:03.821357
 span er head 0:00:03.859893
 triple square 0:00:03.915424
e ex 0:00:03.934438
sy nd 0:00:03.952991
 ot 0:00:03.991752
hers 0:00:04.011104
 polly drives 0:00:04.030115
 sp 0:00:04.048891
linmde drive 0:00:04.067864
 double 0:00:04.086988
e hacks 0:00:04.106561
 many images ac 0:00:04.143887
ccompanyed this in 0:00:04.162591
pu box 0:00:04.181225
 this 0:00:04.200282
 page was last 0:00:04.219203
t modified 0:00:04.237694
 on the ninth of va 0:00:04.258145
april two  0:00:04.276569
 thousand aeight 0:00:04.294845
 at s 0:00:04.313113
seventeen o 0:00:04.332516
 for 0:00:04.380070
 all te 0:00:04.438166
xt as avaivlable 0:00:04.457145
 under the term 0:00:04.475252
ms of thei ganew 0:00:04.494399
 free document 0:00:04.512919
tation licens 0:00:04.531204
 sea 0:00:04.552007
 copyrites  0:00:04.570451
 for details 0:00:04.588813
 wichpedia 0:00:04.627385
 is aregister 0:00:04.646055
n trade mark  0:00:04.682055
 of the wikie mmede 0:00:04.709029
ea foundation 0:00:04.730148
 incorporated 0:00:04.749153
 a  0:00:04.767656
 eu as registrud 0:00:04.786193
d fival 0:00:04.806515
 one sea  0:00:04.839423
 three  0:00:04.877059
 tax theductable 0:00:04.903783
 non profhet c 0:00:04.928306
harity 0:00:04.951226
this sound fi 0:00:04.990772
le and all 0:00:05.009120
 text in the artic 0:00:05.027202
cle or li 0:00:05.046101
cense under  0:00:05.065126
 the thenew fre 0:00:05.084226
e documentation 0:00:05.102388
n license 0:00:05.122325
 availabl 0:00:05.146357
le at  0:00:05.168419
oubl you doubleyou  0:00:05.188388
 dw do 0:00:05.206907
t g 0:00:05.225644
 and you 0:00:05.244077
u dot 0:00:05.262366
 horg 0:00:05.280884
 slash  0:00:05.326040
 cope left 0:00:05.370482
 slash 0:00:05.397159
 f dee 0:00:05.419872
d el 0:00:05.441224
 dout each t 0:00:05.460983
ea m l 0:00:05.479195

Inference End: Avg CPU% 44.149632352941204

The only overhead is there is some sort of load or processing with armnn & gpu as 2nd run always seems to be faster. Which actually for many models doesn’t matter as with code you could preload a trial run and just hold in memory.

rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends GpuAcc CpuAcc CpuRef
Your ArmNN library instance does not support Onnx models parser functionality.  Skipped IOnnxParser import.
Can't load libOpenCL.so: libOpenCL.so: cannot open shared object file: No such file or directory
Can't load libGLES_mali.so: libGLES_mali.so: cannot open shared object file: No such file or directory
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Preferred backends: ['GpuAcc', 'CpuAcc', 'CpuRef']
IDeviceSpec { supportedBackends: [CpuAcc, CpuRef, GpuAcc]}
Optimization warnings: ()
Processing Audio Frames...
 henry f 0:00:00.199833
 phillips 0:00:00.263243
 from wic 0:00:00.328341
epedeia 0:00:00.387643
 the free ind cy 0:00:00.446891
ycliopedioa 0:00:00.506139
 at e 0:00:00.568754
 and d 0:00:00.633177
ot we cud 0:00:00.696274
pedia tha 0:00:00.760866
t org 0:00:00.825377
 henry 0:00:00.951952
 f philip 0:00:01.014317
s 0:00:01.078718
 from  0:00:01.142757
 wickepedia  0:00:01.206237
 the freie andcyclo 0:00:01.272636
opedia 0:00:01.335396
 henry afh 0:00:01.464624
e philips 0:00:01.530219
 eighteen ni 0:00:01.594969
ney to nin 0:00:01.658931
neteen fifty e 0:00:01.725520
ight 0:00:01.790579
 a youas  0:00:01.855143
 businessman 0:00:01.921110
 from portland 0:00:01.984640
d or again 0:00:02.046878
 a 0:00:02.108508
s the honor 0:00:02.173074
 of having the phi 0:00:02.235354
lip's head scre 0:00:02.295376
w and sc 0:00:02.356306
crew driver  0:00:02.419730
 named ater hi 0:00:02.479308
m 0:00:02.541863
 the importance  0:00:02.665169
 of the cross had i 0:00:02.729038
 screw design 0:00:02.792687
 lies 0:00:02.854305
 in its t self ce 0:00:02.916365
entering property 0:00:02.981664
y u 0:00:03.045516
useful on ad 0:00:03.107087
imated production 0:00:03.169201
n lines t 0:00:03.231844
he use powered 0:00:03.294249
d screw drivers 0:00:03.357977
 philip 0:00:03.444255
's major contrib 0:00:03.470326
bution was i 0:00:03.493646
n driving the cro 0:00:03.516106
oss head concept 0:00:03.540040
 forward 0:00:03.563859
 to the point 0:00:03.587168
 where it was a 0:00:03.614574
dopted by scre 0:00:03.639191
ew makers an 0:00:03.662893
nd automabiele c 0:00:03.686058
companiese 0:00:03.709900
 although he recei 0:00:03.756842
ved patents for the 0:00:03.779858
 design in 0:00:03.803692
 nineten thirty  0:00:03.826854
six 0:00:03.849560
 you as pate 0:00:03.873317
nt number  0:00:03.896252
 to million 0:00:03.919767
 forty six 0:00:03.943308
x thousand 0:00:03.966859
 three hundred f 0:00:03.989896
forty three 0:00:04.013471
 you es 0:00:04.036481
 patents t 0:00:04.058903
oo million 0:00:04.082610
 forty six thou 0:00:04.105423
usand aig 0:00:04.129519
he hundred thirty se 0:00:04.151631
ven to t 0:00:04.175407
two million  0:00:04.198211
 forty six thou 0:00:04.219682
sand eight  0:00:04.241892
hundered forty 0:00:04.265862
 it was so wi 0:00:04.313080
dely copied 0:00:04.334637
 that by nin 0:00:04.357252
neteen forty ni 0:00:04.380913
ne phili 0:00:04.405236
ips lost his p 0:00:04.427298
partent 0:00:04.449701
 the american  0:00:04.495946
 screw company 0:00:04.519409
 was respon 0:00:04.541802
nsible for devis 0:00:04.565851
sing a means  0:00:04.588770
 of manufactu 0:00:04.610094
ring the screw 0:00:04.632057
 and  0:00:04.654730
successfully pa 0:00:04.678612
tented and  0:00:04.701875
licence that the are 0:00:04.724467
 method 0:00:04.746392
 o 0:00:04.772910
others screw makers 0:00:04.797385
 of the nineteen h th 0:00:04.819947
hirties dis 0:00:04.842332
missed the phillip's c 0:00:04.865902
concept  0:00:04.890104
since it calls for  0:00:04.913179
 relatively co 0:00:04.934973
mplex r 0:00:04.957153
resisets ocke 0:00:04.979518
et sheepe i 0:00:05.003380
in the head of the scr 0:00:05.027389
rew 0:00:05.051538
 as disti 0:00:05.074455
inct from the simple 0:00:05.097022
e milled slaught 0:00:05.120876
 of a slaugh 0:00:05.143777
hted type scre 0:00:05.166226
w 0:00:05.187945
 the philip' 0:00:05.210199
s  screw compa 0:00:05.234057
any and  0:00:05.257507
 the american sc 0:00:05.279911
rew company 0:00:05.303471
 went on  0:00:05.326708
 to devise  0:00:05.350512
the posi drive 0:00:05.373730
e screw 0:00:05.397346
 which  0:00:05.420818
 differs from the ph 0:00:05.442951
hilips  0:00:05.466475
 in that it is de 0:00:05.489528
signed to accomo 0:00:05.512900
dae greaterd t 0:00:05.535968
ork than the phi 0:00:05.558366
lips 0:00:05.580838
 and image a  0:00:05.628712
companied this arti 0:00:05.651715
icol  0:00:05.673942
 caption 0:00:05.697683
 philips  0:00:05.722421
 screw head 0:00:05.744466
 the followin 0:00:05.812560
g is an infu bo 0:00:05.835601
x which ac 0:00:05.856830
ccompanies this ar 0:00:05.879666
rticle 0:00:05.905828
 in  0:00:05.928671
fu box 0:00:05.951345
 part o thes  0:00:05.975262
 series un 0:00:05.999708
screw drive 0:00:06.022731
 types 0:00:06.045104
 slaughted 0:00:06.091912
 commonl 0:00:06.113519
y eroneou 0:00:06.136937
usly fla 0:00:06.159848
athead 0:00:06.183362
 phylips 0:00:06.228803
 cross he 0:00:06.251184
ad 0:00:06.275104
 pasierive 0:00:06.322722
 super  0:00:06.345218
drive 0:00:06.368872
 tokgs 0:00:06.416137
 ha 0:00:06.460092
cx a 0:00:06.483427
len 0:00:06.506352
 roberts son 0:00:06.550262
try wing 0:00:06.619407
 tark 0:00:06.663290
 set 0:00:06.689627
 span er head 0:00:06.736059
 triple square 0:00:06.803074
e ex 0:00:06.826732
sy nd 0:00:06.848523
 ot 0:00:06.893005
hers 0:00:06.916677
 polly drives 0:00:06.939797
 sp 0:00:06.963346
linmde drive 0:00:06.986584
 double 0:00:07.010035
e hacks 0:00:07.034139
 many images ac 0:00:07.079361
ccompanyed this in 0:00:07.100901
pu box 0:00:07.126865
 this 0:00:07.150326
 page was last 0:00:07.172780
t modified 0:00:07.196601
 on the ninth of va 0:00:07.220068
april two  0:00:07.242628
 thousand aeight 0:00:07.264406
 at s 0:00:07.286862
seventeen o 0:00:07.313602
 for 0:00:07.337783
 all te 0:00:07.386509
xt as avaivlable 0:00:07.409978
 under the term 0:00:07.432496
ms of thei ganew 0:00:07.456288
 free document 0:00:07.479422
tation licens 0:00:07.501876
 sea 0:00:07.523476
 copyrites  0:00:07.545940
 for details 0:00:07.569659
 wichpedia 0:00:07.616525
 is aregister 0:00:07.639508
n trade mark  0:00:07.663383
 of the wikie mmede 0:00:07.686577
ea foundation 0:00:07.710205
 incorporated 0:00:07.734723
 a  0:00:07.757550
 eu as registrud 0:00:07.779905
d fival 0:00:07.802303
 one sea  0:00:07.823395
 three  0:00:07.845993
 tax theductable 0:00:07.869594
 non profhet c 0:00:07.892780
harity 0:00:07.916367
this sound fi 0:00:07.963100
le and all 0:00:07.986285
 text in the artic 0:00:08.009850
cle or li 0:00:08.034208
cense under  0:00:08.057152
 the thenew fre 0:00:08.079343
e documentation 0:00:08.103166
n license 0:00:08.126288
 availabl 0:00:08.149965
le at  0:00:08.173247
oubl you doubleyou  0:00:08.196931
 dw do 0:00:08.220100
t g 0:00:08.242015
 and you 0:00:08.264490
u dot 0:00:08.288011
 horg 0:00:08.310951
 slash  0:00:08.333130
 cope left 0:00:08.356892
 slash 0:00:08.379863
 f dee 0:00:08.403358
d el 0:00:08.426098
 dout each t 0:00:08.449922
ea m l 0:00:08.472750

Inference End: Avg CPU% 6.319485294117635
 henry f 0:00:00.025098
 phillips 0:00:00.048059
 from wic 0:00:00.069682
epedeia 0:00:00.092882
 the free ind cy 0:00:00.115887
ycliopedioa 0:00:00.139495
 at e 0:00:00.162664
 and d 0:00:00.186114
ot we cud 0:00:00.209049
pedia tha 0:00:00.231531
t org 0:00:00.253739
 henry 0:00:00.299566
 f philip 0:00:00.322653
s 0:00:00.346243
 from  0:00:00.368908
 wickepedia  0:00:00.390658
 the freie andcyclo 0:00:00.413116
opedia 0:00:00.436799
 henry afh 0:00:00.482093
e philips 0:00:00.503667
 eighteen ni 0:00:00.525942
ney to nin 0:00:00.549909
neteen fifty e 0:00:00.573129
ight 0:00:00.595368
 a youas  0:00:00.619186
 businessman 0:00:00.642221
 from portland 0:00:00.666204
d or again 0:00:00.688944
 a 0:00:00.712865
s the honor 0:00:00.735683
 of having the phi 0:00:00.759400
lip's head scre 0:00:00.782900
w and sc 0:00:00.805186
crew driver  0:00:00.828966
 named ater hi 0:00:00.851664
m 0:00:00.875356
 the importance  0:00:00.921846
 of the cross had i 0:00:00.944766
 screw design 0:00:00.966324
 lies 0:00:00.989620
 in its t self ce 0:00:01.012966
entering property 0:00:01.036775
y u 0:00:01.059720
useful on ad 0:00:01.082092
imated production 0:00:01.103954
n lines t 0:00:01.126182
he use powered 0:00:01.149641
d screw drivers 0:00:01.172723
 philip 0:00:01.219340
's major contrib 0:00:01.242948
bution was i 0:00:01.265901
n driving the cro 0:00:01.289600
oss head concept 0:00:01.312517
 forward 0:00:01.336215
 to the point 0:00:01.359559
 where it was a 0:00:01.383004
dopted by scre 0:00:01.405795
ew makers an 0:00:01.429504
nd automabiele c 0:00:01.453834
companiese 0:00:01.478208
 although he recei 0:00:01.523546
ved patents for the 0:00:01.545980
 design in 0:00:01.570002
 nineten thirty  0:00:01.593172
six 0:00:01.616883
 you as pate 0:00:01.639729
nt number  0:00:01.662191
 to million 0:00:01.685943
 forty six 0:00:01.708600
x thousand 0:00:01.732211
 three hundred f 0:00:01.756375
forty three 0:00:01.779461
 you es 0:00:01.803172
 patents t 0:00:01.826351
oo million 0:00:01.848640
 forty six thou 0:00:01.872611
usand aig 0:00:01.895845
he hundred thirty se 0:00:01.919687
ven to t 0:00:01.943920
two million  0:00:01.967219
 forty six thou 0:00:01.991258
sand eight  0:00:02.013351
hundered forty 0:00:02.035500
 it was so wi 0:00:02.082262
dely copied 0:00:02.106268
 that by nin 0:00:02.129780
neteen forty ni 0:00:02.153251
ne phili 0:00:02.176276
ips lost his p 0:00:02.198535
partent 0:00:02.222286
 the american  0:00:02.269256
 screw company 0:00:02.292370
 was respon 0:00:02.316087
nsible for devis 0:00:02.339033
sing a means  0:00:02.362709
 of manufactu 0:00:02.386082
ring the screw 0:00:02.407517
 and  0:00:02.429844
successfully pa 0:00:02.453563
tented and  0:00:02.476469
licence that the are 0:00:02.500109
 method 0:00:02.523004
 o 0:00:02.545306
others screw makers 0:00:02.569281
 of the nineteen h th 0:00:02.592558
hirties dis 0:00:02.614155
missed the phillip's c 0:00:02.635826
concept  0:00:02.658427
since it calls for  0:00:02.682316
 relatively co 0:00:02.705207
mplex r 0:00:02.727857
resisets ocke 0:00:02.751483
et sheepe i 0:00:02.774697
in the head of the scr 0:00:02.796381
rew 0:00:02.819614
 as disti 0:00:02.842916
inct from the simple 0:00:02.864951
e milled slaught 0:00:02.887507
 of a slaugh 0:00:02.911580
hted type scre 0:00:02.934264
w 0:00:02.956632
 the philip' 0:00:02.980450
s  screw compa 0:00:03.003790
any and  0:00:03.026059
 the american sc 0:00:03.049843
rew company 0:00:03.072993
 went on  0:00:03.095393
 to devise  0:00:03.119325
the posi drive 0:00:03.142464
e screw 0:00:03.165982
 which  0:00:03.189129
 differs from the ph 0:00:03.212906
hilips  0:00:03.235966
 in that it is de 0:00:03.259806
signed to accomo 0:00:03.282958
dae greaterd t 0:00:03.306649
ork than the phi 0:00:03.329702
lips 0:00:03.351889
 and image a  0:00:03.399796
companied this arti 0:00:03.422819
icol  0:00:03.446313
 caption 0:00:03.469297
 philips  0:00:03.491633
 screw head 0:00:03.514018
 the followin 0:00:03.582819
g is an infu bo 0:00:03.606589
x which ac 0:00:03.629573
ccompanies this ar 0:00:03.651937
rticle 0:00:03.675519
 in  0:00:03.698401
fu box 0:00:03.722268
 part o thes  0:00:03.745215
 series un 0:00:03.768986
screw drive 0:00:03.793040
 types 0:00:03.815889
 slaughted 0:00:03.859728
 commonl 0:00:03.882820
y eroneou 0:00:03.906015
usly fla 0:00:03.929665
athead 0:00:03.953453
 phylips 0:00:03.998611
 cross he 0:00:04.022581
ad 0:00:04.045314
 pasierive 0:00:04.092172
 super  0:00:04.115807
drive 0:00:04.139027
 tokgs 0:00:04.185378
 ha 0:00:04.231554
cx a 0:00:04.255428
len 0:00:04.278669
 roberts son 0:00:04.325458
try wing 0:00:04.395594
 tark 0:00:04.446092
 set 0:00:04.470235
 span er head 0:00:04.516097
 triple square 0:00:04.585351
e ex 0:00:04.608908
sy nd 0:00:04.631941
 ot 0:00:04.678754
hers 0:00:04.702643
 polly drives 0:00:04.725733
 sp 0:00:04.749337
linmde drive 0:00:04.772469
 double 0:00:04.796209
e hacks 0:00:04.818715
 many images ac 0:00:04.865688
ccompanyed this in 0:00:04.889308
pu box 0:00:04.912474
 this 0:00:04.934762
 page was last 0:00:04.958719
t modified 0:00:04.981706
 on the ninth of va 0:00:05.005422
april two  0:00:05.027216
 thousand aeight 0:00:05.049110
 at s 0:00:05.071463
seventeen o 0:00:05.095495
 for 0:00:05.119464
 all te 0:00:05.165063
xt as avaivlable 0:00:05.188987
 under the term 0:00:05.212004
ms of thei ganew 0:00:05.234286
 free document 0:00:05.256753
tation licens 0:00:05.278964
 sea 0:00:05.302575
 copyrites  0:00:05.325955
 for details 0:00:05.348070
 wichpedia 0:00:05.393193
 is aregister 0:00:05.416493
n trade mark  0:00:05.439150
 of the wikie mmede 0:00:05.462763
ea foundation 0:00:05.485545
 incorporated 0:00:05.509280
 a  0:00:05.532497
 eu as registrud 0:00:05.556309
d fival 0:00:05.579172
 one sea  0:00:05.601571
 three  0:00:05.623904
 tax theductable 0:00:05.647556
 non profhet c 0:00:05.671749
harity 0:00:05.694886
this sound fi 0:00:05.741046
le and all 0:00:05.763774
 text in the artic 0:00:05.786158
cle or li 0:00:05.809879
cense under  0:00:05.833817
 the thenew fre 0:00:05.855954
e documentation 0:00:05.878282
n license 0:00:05.901719
 availabl 0:00:05.924833
le at  0:00:05.947234
oubl you doubleyou  0:00:05.971071
 dw do 0:00:05.995365
t g 0:00:06.018369
 and you 0:00:06.042143
u dot 0:00:06.064688
 horg 0:00:06.088104
 slash  0:00:06.111014
 cope left 0:00:06.133253
 slash 0:00:06.156871
 f dee 0:00:06.180072
d el 0:00:06.202211
 dout each t 0:00:06.223746
ea m l 0:00:06.246271

Inference End: Avg CPU% 6.539338235294104

CpuRef runs on the calling core only and not going to wait for this to end :slightly_smiling_face:

rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ python3 run_audio_file.py --audio_file_path samples/hp0.wav --model_file_path tflite_int8/wav2letter_int8.tflite --preferred_backends CpuRef
Your ArmNN library instance does not support Onnx models parser functionality.  Skipped IOnnxParser import.
Can't load libOpenCL.so: libOpenCL.so: cannot open shared object file: No such file or directory
Can't load libGLES_mali.so: libGLES_mali.so: cannot open shared object file: No such file or directory
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Preferred backends: ['CpuRef']
IDeviceSpec { supportedBackends: [CpuAcc, CpuRef, GpuAcc]}
Optimization warnings: ()
Processing Audio Frames...
 henry f 0:00:17.428529
 phillips 0:00:34.844938
 from wi 0:00:52.256472
epedeia 0:01:09.669870
 the free ind cy 0:01:27.090357
yclopedio 0:01:44.510629
 at e 0:02:01.929057
 nd d 0:02:19.346602
ot we ckuld 0:02:36.765552
pedia tha 0:02:54.186937
t org 0:03:11.611450
 henry 0:03:46.463853
 f philip 0:04:03.888195
s 0:04:21.314803
 from  0:04:38.740595
 wickepedia  0:04:56.169110
 the freie andcyclo 0:05:13.594531
opedia 0:05:31.026083
 henry fh 0:06:05.888977
e philips 0:06:23.323438
 eighteen ni 0:06:40.754814
nety to nin 0:06:58.187929
neteen fifty e 0:07:15.619196
ight 0:07:33.052838
 a yous  0:07:50.487450
 businessman 0:08:07.922139
 from portland 0:08:25.360068
d or again 0:08:42.795694
 a 0:09:00.233274
s the honor 0:09:17.666807
 of having the phi 0:09:35.106635
lip's head scre 0:09:52.540795
w and sc 0:10:09.976951
crewe driver  0:10:27.415814
 named after hi 0:10:44.852451
m 0:11:02.292572
 the importance  0:11:37.170950
 of the cross head  0:11:54.611701
 screw design 0:12:12.051614
 lies 0:12:29.492745
 in its t self ce 0:12:46.934857
entering property 0:13:04.377058
y  0:13:21.816384
useful on ad 0:13:39.257484
imnated production 0:13:56.696562
n lines t 0:14:14.136365
he use powered 0:14:31.577671
d screw drivers 0:14:49.021440
 philip 0:15:23.903284
's madjour contrib 0:15:41.344776
bution was i 0:15:58.786934
n driving the cro 0:16:16.227175
oss head concept 0:16:33.665869
 forward 0:16:51.105858
 to the point 0:17:08.546104
 where it was a 0:17:25.988356
dopted by scre 0:17:43.433439
ew makers an 0:18:00.875144
nd automabiele c 0:18:18.316596
companiese 0:18:35.762228
 although he recei 0:19:10.653780
ved patents for the 0:19:28.100305
 design in 0:19:45.545399
 nineteen thirty  0:20:02.988668
six 0:20:20.431421
 you as pate 0:20:37.879302
nt number  0:20:55.325021

Going back to Jammy & Wayland
With https://github.com/armbian/build/releases/download/22.11.0-trunk.0096/Armbian_22.11.0-trunk.0096_Rock-5b_jammy_legacy_5.10.72.img.xz
Plus current ppa still crashes and wondering if any updates on that? Still seems to be holding some packages back so maybe the ddr update is not in place.
@icecream95 If I build glmark2-es2-wayland and install it runs complete and ends with a Glmark2 Score 866
That is the normal like for like env I usually run glmark with no affinity or scheduler changes and less than I would hope for but hey it didn’t crash!
gg @icecream95 as things looking nearer

echo performance | sudo tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor


Bad bash script time

while :
        LOAD=$(cat /sys/devices/platform/fb000000.gpu/utilisation)
        echo "Current Load:"$LOAD
        echo "Avg Load:"$((TOTAL/COUNT))
        sleep 1

Load on avg 41%

What crashes? How? When you do what? Does anything appear in dmesg, the console, or journalctl?

Yup, that’s about what’s expected. It will take some work to change the driver so that work can be done in parallel between the CPU and GPU. With CSF the kernel doesn’t help with doing that as much as it did previously.

I didn’t check that much as the repo glmark2-es2-wayland seems a catastrophic crash and locked up the wayland client.
I just compiled glmark2-es2-wayland with .waf and runs fine, so apols didn’t do any bug hunting as mostly used the Jammy Radxa image due to ArmNN as not sure if 22.04 is part of the repo (18.04/20.04 is)

The RK3588 is a low wattage powerhouse SBC for ML with the CPU & GPU and the 3rd option of the NPU that haven’t tried yet as not to optimistic about quite a few dmesg entries with the NPU.
With the amount of vendors and variations the rk3588 & rk3588s could become an extremely common platform.
Have you reached out to Alyssa and Mesa like Asahi Linya did with the M1 as a herd sharing and contributing to a mainline is likely going to be quicker and stronger than many all soloing?
In a perfect world Mfcc & Spectrogram 16Khz would be part of codecs and near load free but both Tensorflow & Pytorch have optimised C routines and strange why Arm opted for Python that can not be stressed how much it sucks for any DSP :slight_smile:
Many thanks for your time and input.

Starting with Armbian_22.11.0-trunk.0096_Rock-5b_jammy_legacy_5.10.72.img
My sort of legacy thing I do as a check its all there sudo apt-get update && upgrade sudo apt-get install software-properties-common apt-utils git cmake build-essential autotools-dev autoconf libtool pkg-config reboot
Get the csf bin to lib/firmware
sudo add-apt-repository ppa:liujianfeng1994/panfork-mesa
sudo apt install ubuntu-desktop

Log in to wayland
sudo apt-get install glmark2-es2-wayland

This time runs no prob but before on 2x different try I got a total catastrophic lockup and decided to compile by waf myself.
Guess not waf but just a spurious crash.

Still get artefacts on the background image but always ignored that.

So dunno as tried it again expecting a crash and worked fine GLmark2 Score 872

On radxas debian, I cannot see mouse cursor on wayland gnome using @icecream95 's panfork, but the cursor is visible on KDE wayland(and weston), and I’m not seeing the graphic glitch that @stuartiannaylor is seeing on either desktop. I compiled panfork myself.

KDE wayland is having performance issue using OpenGL based compositors, the desktop will hang from time to time, and back to normal after 1 second or 2, but XRender works just fine. Firefox is also having such hanging issue even on XRender compositor. However, Gnome does not have such performance issue at all, although I cannot see the cursor, firefox is very smooth.

Weston works perfectly, performance is good, cursor is visible, just Weston itself is not so user-friendly and I don’t want to use it every day.

X11 works fine on both KDE and Gnome, although less performant than wayland on my 4K60 displays and I can see frame drops, but at least it does not have serious bugs that affects normal usage.

There is one thing that I’m using the mali_csffw.bin come with radxas debian, not the one in JeffyCN’s repo.

mali_csffw.bin didn’t know there was a diff as using JeffyCN’s repo.

The Radxas version seems to be an older version than the Jeff’s Repo ones, and I’m have some problems with Jeff’s in early days of testing so I used Radxas one.

I don’t know to be honest as a bit confused by it all as not really my thing.
I am not sure where the source is for either or how they come into existence :slight_smile:
Using Jammy server from armbian with JeffyCn’s csffw and icecreams ppa with a sudo apt install ubuntu-desktop.
The background glitch is weird as it only seems to happen on the background and no where else, haven’t used it much but everything else seems to work and new windows don’t have similar glitches.
So I have always just ignored it.

It was purely curiosity to the performance as the mali seems to be only put under approx 40% load by what I presume is the gles driver whilst playing with ml and using openCl that can avg around 75% load on the mali.
I just thought Wayland might be more performant

OpenGL ES GPU pipelining
OpenGL ES exposes a synchronous rendering model to the application developer, despite
the underlying execution being asynchronous, whereas Vulkan exposes this
asynchronous nature directly to the application developer to manage. In either case it is
important that the application keeps the GPU fed with work, avoiding behaviors which
drain the pipeline and starve the GPU of work (unless the desired target frame rate has
been reached, of course).
Keeping the GPU busy not only means that you will get the best rendering performance
from the platform, but also that you avoid hitting performance oscillations caused by the
platform dynamic voltage and frequency scaling (DVFS) logic thinking that the CPU or
GPU is under-utilized.
▪ Do not let the GPU go idle unless the target performance is reached.
▪ Pipeline any use of fences and query objects; don't wait on them too early.
▪ Use GL_MAP_UNSYNCHRONIZED to allow use of glMapBufferRange() to patch a safe
region of a buffer which is still partially referenced by in-flight draw calls.
▪ Pipeline any glReadPixels() calls to read asynchronously into a pixel buffer object.
▪ Use operations which enforce the synchronous behavior of OpenGL ES:
▪ glFinish()
▪ Synchronous glReadPixels()
▪ glMapBufferRange() without GL_MAP_UNSYNCHRONIZED on buffer still referenced
by a draw call
▪ Use glMapBufferRange() with
historical specification ambiguity these flags will currently trigger the creation of a
resource ghost.
▪ Use glFlush() because this may force render passes to be split; the driver will flush as
▪ Pipeline draining will, at a minimum, result in a loss of performance as the GPU will
be partially idle for the duration of the bubble.
▪ Possible performance instability, depending on the interaction with the platform's
DVFS power management logic.
▪ System profilers such as DS-5 Streamline can show both CPU and GPU activity.
Pipeline drains of this nature are normally clearly visible as periods of busy time
oscillating between the CPU and GPU, with neither being fully utilized.

As I thought X11 was forcing all those don’t of synchronous operation as from what @icecream95 said that seems to be the problem with no parallelism with gpu/cpu as one is waiting for the other?
I was purely trying to get a best rough bench so will have to leave it to you 2. Apols :slight_smile:

In the DTB it seems to be declared as bifrost which don’t have a CSF so yeah I am totally confused at state of play :slight_smile:

	gpu@fb000000 {
		compatible = "arm,mali-bifrost";
		reg = <0x00 0xfb000000 0x00 0x200000>;
		interrupts = <0x00 0x5e 0x04 0x00 0x5d 0x04 0x00 0x5c 0x04>;
		interrupt-names = "GPU\0MMU\0JOB";
		clocks = <0x0e 0x05 0x02 0x115 0x02 0x116 0x02 0x114>;
		clock-names = "clk_mali\0clk_gpu_coregroup\0clk_gpu_stacks\0clk_gpu";
		assigned-clocks = <0x0e 0x05>;
		assigned-clock-rates = <0xbebc200>;
		power-domains = <0x4e 0x0c>;
		operating-points-v2 = <0x4f>;
		#cooling-cells = <0x02>;
		dynamic-power-coefficient = <0xba6>;
		upthreshold = <0x1e>;
		downdifferential = <0x0a>;
		status = "okay";
		mali-supply = <0x50>;
		mem-supply = <0x50>;
		phandle = <0x4d>;

I’ve been trying to debug that issue for a couple of days, but still don’t know how to fix it…

I don’t want to implement whatever hacks the blob is doing for Wayland, so improving the average GPU load will require some kernel changes. It’s still a while to get to that point, though.

The BSP kernel has three copies of Arm’s Mali kernel driver—one for Utgard (mali400), one for Midgard, and one for Bifrost (which also supports Valhall). Changes to the kernel driver for Valhall support went to the Bifrost source tree, so the GPU is marked compatible with the Bifrost kernel driver.

The Bifrost kernel and Valhall kernel downloads from Arm are identical.

From the kernel’s perspective, (pre-CSF, “Job Manager”) Valhall is more
or less compatible with Bifrost

Was it the 1st/2nd gen Valhall that didn’t have a CSF and that is what is completely missing and this has been a complete confusion to me as its a pretty big change but still keeps the same Architecture name that I guess should be identified as mali-csf ?

There seems less of a change between bifrost & the 1st/2nd gen Valhall that did get a name change then 3rd gen Valhall comes with a substantially different CSF that really is not incremental as other Valhall additions to bifrost have been but does keep a same architecture name ?!?

So yeah I am totally confused and yeah the Bifrost kernel and Valhall kernel prob are identical and why the MaliG610 with no CSF coordination is only achieving 40% load.
Also we are on this BSP kernel of 5.10 that is totally out of wack with Mesa so guess also you are having to heavily backport.

The man in the middle of the CSF between cpu irq and gpu and a queue that isn’t waiting for completion should have a kernel implementation I presume but currently is being hacked into the userspace?

I am going to have to leave it to you @icecream95 but I am not really sure about soloing this as surely the herd of Mesa and alyssa are needed to pool resources?

@stuartiannaylor @icecream95
Just curious, does HW video decoding work in any browser (e.g. FF or Chromium) with that Armbian+Mali driver combo under Wayland? I’m interested to give it a try soon.

The only one that would work probably is the custom chromium provided by rockchip. The one shipped with radxa debian does not support wayland but rockchip said that the wayland patch does exist and some customers are using that, probably Khadas is ?

I am not sure as was purely interested if the Wayland implementation helps with the current problems of balancing the cpu / gpu irq calls in the new command stream as would seem currently the Mali G610 is very underloaded as the parallelism doesn’t seem to exist at all so one is always waiting for the other and process seems serial.

Its @icecream95 who knows a huge amount as is prolific contributor and did much with OpenCL so is in a totally different league to me.

Put simplistically to what my limited brain can handle is the current drivers are a bit like the early Mali-FB-Dev drivers we had that did specific jobs but likely have loads of gotchas as it may not be plumbed into ’ VA-API’ which is how you get Chromium acceleration which I didn’t try. Its video accel really when browsing than 3d.

Because the Mali G610 is such a strong GPU currently for many the results are quite good, but really when looking at load we have a pretty terrible implementation.
I am sort of confused as the CSF & the third gen Valhall would seem a more significant change to its 1st&2nd gen Valhall are to the previous Bifrost they got a name change from.

cat /sys/devices/platform/fb000000.gpu/utilisation so to be honest until we get an optimised driver under the hood I haven’t even bothered testing Chromium.
The Mali is just a 3D Gpu and with video we also have hardware encoders to plumb in and again I am confused there as I thought a lot of work had been done on the state management of video encoder/decoder and the Hantro RKDec shared a lot of similarities but apparently not.

I think there might be considerable work needed on the Mali G610 driver and it could take some time but when using it with OpenCl load gets to 70% which might be near max due to the ArmNN & OpenCl layers using it in the above Wav2Letter example, but even that might even get more optimised.

I am not worthy or any where near the level of Icecream95 and was just curious to state of play and got loads of really great feedback. I am hoping some of the dmesg errors of the NPU get fixed so I can play with the NPU as not expecting anything imminent GPU wise.

I am also thinking this might be the fastest mainline adoption we will see due to the adoption of a large number of rk3588/rk3588s vendors and variations out there as really I would prefer to ditch the Rockchip BSP.