M2.ssd will randomly fail to write on rock3a

jaysonsantos · April 1, 2022, 1:51pm

hey there, i have been running a rock3a for a while with a poe hat and nvme attached to it and it randomly starts to fail to write to the disk and then the kernel remounts it as readonly.
you can see the dmesg log here gist:dfb76e9401ebb932bbfdb90734a4ebc8 (github.com)
and the kernel commit i am running is the following arm64: dts: radxa cm3 io: add HD101BOE9365 Display support · radxa/kernel@12d0b2b (github.com)
with this only change on the config

diff --git a/arch/arm64/configs/rockchip_linux_defconfig b/arch/arm64/configs/rockchip_linux_defconfig
index bcfd115935e4..5af59cabd46e 100644
--- a/arch/arm64/configs/rockchip_linux_defconfig
+++ b/arch/arm64/configs/rockchip_linux_defconfig
@@ -1166,3 +1166,8 @@ CONFIG_FUNCTION_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_LKDTM=y
 CONFIG_CGROUP_NET_PRIO=y
+
+CONFIG_SCSI_ISCSI_ATTRS=y
+CONFIG_ISCSI_TCP=y
+
+CONFIG_ARM64_VA_BITS_48=y

any idea of what could be causing it? could it be power? the poe switch provides up to 30W per port

setq · April 2, 2022, 3:23am

Hi, please upload the dmesg.

jaysonsantos · April 2, 2022, 9:31pm

Hi, the dmesg is on the link, as there is body limit here and no txt can be uploaded.

gist.github.com

https://gist.github.com/jaysonsantos/dfb76e9401ebb932bbfdb90734a4ebc8

gistfile0.txt

-- Logs begin at Thu 2022-03-31 01:08:20 UTC, end at Fri 2022-04-01 13:43:38 UTC. --
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Booting Linux on physical CPU 0x0000000000 [0x412fd050]
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Linux version 4.19.193-1001-rockchip-g12d0b2b258a3 (jayson@hp-proliant-dl160-g6-1) (gcc version 10.3.1 20210621 (GNU Toolchain for the A-profile Architecture 10.3-2021.07 (arm-10.29)), GNU ld (GNU Toolchain for the A-profile Architecture 10.3-2021.07 (arm-10.29)) 2.36.1.20210621) #rockchip SMP Thu Mar 31 13:27:16 UTC 2022
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Machine model: Radxa ROCK 3 Model A
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: fdt: Reserved memory: failed to reserve memory for node 'drm-logo@00000000': base 0x0000000000000000, size 0 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: fdt: Reserved memory: failed to reserve memory for node 'drm-cubic-lut@00000000': base 0x0000000000000000, size 0 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Reserved memory: created CMA memory pool at 0x00000001e0000000, size 512 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: reserved mem: initialized node rknpu, compatible id shared-dma-pool
Apr 01 09:43:04 k3s-rock-3a-1 kernel: cma: Reserved 16 MiB at 0x00000000ef000000
Apr 01 09:43:04 k3s-rock-3a-1 kernel: On node 0 totalpages: 2031104

This file has been truncated. show original

jack · April 5, 2022, 11:48am

From the log, the nvme ssd is down, what power adapter are you using and did you connect other USB devices?

jaysonsantos · April 5, 2022, 1:43pm

hey @jack i am using a poe that delivers up to 30W per port and no other usb devices connected, just the nvme and ethernet cable

jack · April 6, 2022, 11:10am

What PoE HAT are you using? I suspect the voltage for 5V is dropped so the NVMe is under voltage.

jaysonsantos · April 6, 2022, 11:36am

Hey @jack it is a RockPi_PoE_F4L Rock Pi 4, if you point me where I should take the measurement of the voltage, I can double-check it.

aghost · April 8, 2022, 7:55am

@jack
i have the same problem.
this problem has been happend on my 4 sbc.
the proer supply is CoolMaster GX550.

aghost · April 8, 2022, 8:00am

i’m using this way on cron to alleviate this problem.

check if nvme fs mount has mounted and is rw
check /dev/ has nvme partition mount point, if not, reboot
stop all service which usins the nvme fs mount point
umount all nvme partition
run fsck
remount all nvme partition
restart alll service which had stoped before

jaysonsantos · May 2, 2022, 7:16pm

thanks for this but, it still is not a solution on the long run.
i wonder if i with emmc the same thing would happen.

jaysonsantos · May 2, 2022, 7:21pm

Hey jack, even if i plug a QC charger in it the same happens.
To reproduce it faster you can do apt install stress && cd /nvme-path && journalctl -kf & && stress --cpu 8 --vm 8 --hdd 8

jaysonsantos · June 6, 2022, 8:25pm

hey @jack do you have anywhere the specs of the maximum current allowed for m2 ssd? that is the only thing i can think of, i have different versions at home and both of them with different power requirements

jack · June 7, 2022, 2:30am

We will reproduce this issue, the design power current for the SSD on ROCK 3A is 5A. Check your SSD’s power consumption. Usually it’s 3.3V 3A peak.

jaysonsantos · June 7, 2022, 6:51am

Hey @jack, thanks for the answer!
The one i have home is a crucial with 3.3V 2.5A so, I should still have plenty of power left.
I will try one of those kernel 5.x builds to try and check if their stack have better luck

jaysonsantos · June 7, 2022, 6:29pm

Hey jack, good news, using armbian (Armbian 22.05.1 Bullseye) with the following kernel:
Linux rock-3a 5.18.0-rk35xx #22.05.1 SMP PREEMPT Sat May 28 08:41:15 UTC 2022 aarch64 GNU/Linux
I’ve managed to run a stress test like the one i shared above for 30 minutes without having the nvme remount.
Smart complained, but so far it is way better:
Device: /dev/nvme0, number of Error Log entries increased from 476187 to 476188

jaysonsantos · June 7, 2022, 6:31pm

Scratch that, the error came back after a while

jack · June 13, 2022, 8:34am

Hi, @jaysonsantos

We have a 3A with 23W PoE HAT setup running for some days, we can reproduce this issue now. We need more investigation on this issues. I will update if we have new finding.

jaysonsantos · June 13, 2022, 8:54am

hey @jack, thanks for the answer!
i noticed that in armbian with kernel 5.18, it does happen but less frequently.
could this [1] patch be benefical to be backported into armbian?
[1] https://lore.kernel.org/linux-arm-kernel/165459351568.925770.13686160465924068647.b4-ty@sntech.de/T/

amazingfate · June 13, 2022, 9:37am

I notice that the armbian mainline kernel doesn’t have this patch, but my recent work on 5.19 rc1 has this patch involved: https://github.com/amazingfate/build/blob/rockchip64-5.19/patch/kernel/archive/rockchip64-5.19/rk356x-dts-pcie2x1.patch#L13
You can try my 5.19rc1 kernel: https://drive.google.com/drive/folders/1y4fYI87xFvOrChLhlyevPcZxeP8gHBQM?usp=sharing

aghost · June 17, 2022, 2:08am

Did you locate the problem?