M2.ssd will randomly fail to write on rock3a

hey there, i have been running a rock3a for a while with a poe hat and nvme attached to it and it randomly starts to fail to write to the disk and then the kernel remounts it as readonly.
you can see the dmesg log here gist:dfb76e9401ebb932bbfdb90734a4ebc8 (github.com)
and the kernel commit i am running is the following arm64: dts: radxa cm3 io: add HD101BOE9365 Display support · radxa/kernel@12d0b2b (github.com)
with this only change on the config

diff --git a/arch/arm64/configs/rockchip_linux_defconfig b/arch/arm64/configs/rockchip_linux_defconfig
index bcfd115935e4..5af59cabd46e 100644
--- a/arch/arm64/configs/rockchip_linux_defconfig
+++ b/arch/arm64/configs/rockchip_linux_defconfig
@@ -1166,3 +1166,8 @@ CONFIG_FUNCTION_TRACER=y

any idea of what could be causing it? could it be power? the poe switch provides up to 30W per port

Hi, please upload the dmesg.

Hi, the dmesg is on the link, as there is body limit here and no txt can be uploaded.

From the log, the nvme ssd is down, what power adapter are you using and did you connect other USB devices?

hey @jack i am using a poe that delivers up to 30W per port and no other usb devices connected, just the nvme and ethernet cable

What PoE HAT are you using? I suspect the voltage for 5V is dropped so the NVMe is under voltage.

Hey @jack it is a RockPi_PoE_F4L Rock Pi 4, if you point me where I should take the measurement of the voltage, I can double-check it.

i have the same problem.
this problem has been happend on my 4 sbc.
the proer supply is CoolMaster GX550.

i’m using this way on cron to alleviate this problem.

  1. check if nvme fs mount has mounted and is rw
  2. check /dev/ has nvme partition mount point, if not, reboot
  3. stop all service which usins the nvme fs mount point
  4. umount all nvme partition
  5. run fsck
  6. remount all nvme partition
  7. restart alll service which had stoped before

thanks for this but, it still is not a solution on the long run.
i wonder if i with emmc the same thing would happen.

Hey jack, even if i plug a QC charger in it the same happens.
To reproduce it faster you can do apt install stress && cd /nvme-path && journalctl -kf & && stress --cpu 8 --vm 8 --hdd 8

hey @jack do you have anywhere the specs of the maximum current allowed for m2 ssd? that is the only thing i can think of, i have different versions at home and both of them with different power requirements

We will reproduce this issue, the design power current for the SSD on ROCK 3A is 5A. Check your SSD’s power consumption. Usually it’s 3.3V 3A peak.

Hey @jack, thanks for the answer!
The one i have home is a crucial with 3.3V 2.5A so, I should still have plenty of power left.
I will try one of those kernel 5.x builds to try and check if their stack have better luck

Hey jack, good news, using armbian (Armbian 22.05.1 Bullseye) with the following kernel:
Linux rock-3a 5.18.0-rk35xx #22.05.1 SMP PREEMPT Sat May 28 08:41:15 UTC 2022 aarch64 GNU/Linux
I’ve managed to run a stress test like the one i shared above for 30 minutes without having the nvme remount.
Smart complained, but so far it is way better:
Device: /dev/nvme0, number of Error Log entries increased from 476187 to 476188

Scratch that, the error came back after a while

Hi, @jaysonsantos

We have a 3A with 23W PoE HAT setup running for some days, we can reproduce this issue now. We need more investigation on this issues. I will update if we have new finding.

hey @jack, thanks for the answer!
i noticed that in armbian with kernel 5.18, it does happen but less frequently.
could this [1] patch be benefical to be backported into armbian?
[1] https://lore.kernel.org/linux-arm-kernel/165459351568.925770.13686160465924068647.b4-ty@sntech.de/T/

1 Like

I notice that the armbian mainline kernel doesn’t have this patch, but my recent work on 5.19 rc1 has this patch involved: https://github.com/amazingfate/build/blob/rockchip64-5.19/patch/kernel/archive/rockchip64-5.19/rk356x-dts-pcie2x1.patch#L13
You can try my 5.19rc1 kernel: https://drive.google.com/drive/folders/1y4fYI87xFvOrChLhlyevPcZxeP8gHBQM?usp=sharing

Did you locate the problem?