In general I would better look than speculate and that’s what atop
and /proc/interrupts
are for
IRQ affinity with RPi kernels is broken and as such cpu0
might be the bottleneck. That combined with ondemand
w/o io_is_busy=1
and cpu0
while serving all interrupts might not even clock at 2400 MHz. And writes may generate twice as much IRQs compared to reads. But that’s speculation and it’s really easy to check for this with the aforementioned methods.
Also switching to performance
governor might result in better throughput numbers since this masquerades parts of the problem (clockspeeds not ramping up quickly enough).