hey there, i have been running a rock3a for a while with a poe hat and nvme attached to it and it randomly starts to fail to write to the disk and then the kernel remounts it as readonly.
you can see the dmesg log here gist:dfb76e9401ebb932bbfdb90734a4ebc8 (github.com)
and the kernel commit i am running is the following arm64: dts: radxa cm3 io: add HD101BOE9365 Display support · radxa/kernel@12d0b2b (github.com)
with this only change on the config
diff --git a/arch/arm64/configs/rockchip_linux_defconfig b/arch/arm64/configs/rockchip_linux_defconfig
index bcfd115935e4..5af59cabd46e 100644
--- a/arch/arm64/configs/rockchip_linux_defconfig
+++ b/arch/arm64/configs/rockchip_linux_defconfig
@@ -1166,3 +1166,8 @@ CONFIG_FUNCTION_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_LKDTM=y
CONFIG_CGROUP_NET_PRIO=y
+
+CONFIG_SCSI_ISCSI_ATTRS=y
+CONFIG_ISCSI_TCP=y
+
+CONFIG_ARM64_VA_BITS_48=y
any idea of what could be causing it? could it be power? the poe switch provides up to 30W per port
setq
April 2, 2022, 3:23am
2
Hi, please upload the dmesg.
Hi, the dmesg is on the link, as there is body limit here and no txt can be uploaded.
gistfile0.txt
-- Logs begin at Thu 2022-03-31 01:08:20 UTC, end at Fri 2022-04-01 13:43:38 UTC. --
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Booting Linux on physical CPU 0x0000000000 [0x412fd050]
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Linux version 4.19.193-1001-rockchip-g12d0b2b258a3 (jayson@hp-proliant-dl160-g6-1) (gcc version 10.3.1 20210621 (GNU Toolchain for the A-profile Architecture 10.3-2021.07 (arm-10.29)), GNU ld (GNU Toolchain for the A-profile Architecture 10.3-2021.07 (arm-10.29)) 2.36.1.20210621) #rockchip SMP Thu Mar 31 13:27:16 UTC 2022
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Machine model: Radxa ROCK 3 Model A
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: fdt: Reserved memory: failed to reserve memory for node 'drm-logo@00000000': base 0x0000000000000000, size 0 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: fdt: Reserved memory: failed to reserve memory for node 'drm-cubic-lut@00000000': base 0x0000000000000000, size 0 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: Reserved memory: created CMA memory pool at 0x00000001e0000000, size 512 MiB
Apr 01 09:43:04 k3s-rock-3a-1 kernel: OF: reserved mem: initialized node rknpu, compatible id shared-dma-pool
Apr 01 09:43:04 k3s-rock-3a-1 kernel: cma: Reserved 16 MiB at 0x00000000ef000000
Apr 01 09:43:04 k3s-rock-3a-1 kernel: On node 0 totalpages: 2031104
This file has been truncated. show original
jack
April 5, 2022, 11:48am
4
From the log, the nvme ssd is down, what power adapter are you using and did you connect other USB devices?
hey @jack i am using a poe that delivers up to 30W per port and no other usb devices connected, just the nvme and ethernet cable
jack
April 6, 2022, 11:10am
6
What PoE HAT are you using? I suspect the voltage for 5V is dropped so the NVMe is under voltage.
Hey @jack it is a RockPi_PoE_F4L Rock Pi 4, if you point me where I should take the measurement of the voltage, I can double-check it.
aghost
April 8, 2022, 7:55am
8
@jack
i have the same problem.
this problem has been happend on my 4 sbc.
the proer supply is CoolMaster GX550.
aghost
April 8, 2022, 8:00am
9
i’m using this way on cron to alleviate this problem.
check if nvme fs mount has mounted and is rw
check /dev/ has nvme partition mount point, if not, reboot
stop all service which usins the nvme fs mount point
umount all nvme partition
run fsck
remount all nvme partition
restart alll service which had stoped before
thanks for this but, it still is not a solution on the long run.
i wonder if i with emmc the same thing would happen.
Hey jack, even if i plug a QC charger in it the same happens.
To reproduce it faster you can do apt install stress && cd /nvme-path && journalctl -kf & && stress --cpu 8 --vm 8 --hdd 8
hey @jack do you have anywhere the specs of the maximum current allowed for m2 ssd? that is the only thing i can think of, i have different versions at home and both of them with different power requirements
jack
June 7, 2022, 2:30am
13
We will reproduce this issue, the design power current for the SSD on ROCK 3A is 5A. Check your SSD’s power consumption. Usually it’s 3.3V 3A peak.
Hey @jack , thanks for the answer!
The one i have home is a crucial with 3.3V 2.5A so, I should still have plenty of power left.
I will try one of those kernel 5.x builds to try and check if their stack have better luck
Hey jack, good news, using armbian (Armbian 22.05.1 Bullseye) with the following kernel:
Linux rock-3a 5.18.0-rk35xx #22 .05.1 SMP PREEMPT Sat May 28 08:41:15 UTC 2022 aarch64 GNU/Linux
I’ve managed to run a stress test like the one i shared above for 30 minutes without having the nvme remount.
Smart complained, but so far it is way better:
Device: /dev/nvme0, number of Error Log entries increased from 476187 to 476188
Scratch that, the error came back after a while
jack
June 13, 2022, 8:34am
17
Hi, @jaysonsantos
We have a 3A with 23W PoE HAT setup running for some days, we can reproduce this issue now. We need more investigation on this issues. I will update if we have new finding.
hey @jack , thanks for the answer!
i noticed that in armbian with kernel 5.18, it does happen but less frequently.
could this [1] patch be benefical to be backported into armbian?
[1] https://lore.kernel.org/linux-arm-kernel/[email protected] /T/
1 Like
aghost
June 17, 2022, 2:08am
20
Did you locate the problem?