M2.ssd will randomly fail to write on rock3a

@jack
i have the same problem.
this problem has been happend on my 4 sbc.
the proer supply is CoolMaster GX550.

i’m using this way on cron to alleviate this problem.

  1. check if nvme fs mount has mounted and is rw
  2. check /dev/ has nvme partition mount point, if not, reboot
  3. stop all service which usins the nvme fs mount point
  4. umount all nvme partition
  5. run fsck
  6. remount all nvme partition
  7. restart alll service which had stoped before

thanks for this but, it still is not a solution on the long run.
i wonder if i with emmc the same thing would happen.

Hey jack, even if i plug a QC charger in it the same happens.
To reproduce it faster you can do apt install stress && cd /nvme-path && journalctl -kf & && stress --cpu 8 --vm 8 --hdd 8

hey @jack do you have anywhere the specs of the maximum current allowed for m2 ssd? that is the only thing i can think of, i have different versions at home and both of them with different power requirements

We will reproduce this issue, the design power current for the SSD on ROCK 3A is 5A. Check your SSD’s power consumption. Usually it’s 3.3V 3A peak.

Hey @jack, thanks for the answer!
The one i have home is a crucial with 3.3V 2.5A so, I should still have plenty of power left.
I will try one of those kernel 5.x builds to try and check if their stack have better luck

Hey jack, good news, using armbian (Armbian 22.05.1 Bullseye) with the following kernel:
Linux rock-3a 5.18.0-rk35xx #22.05.1 SMP PREEMPT Sat May 28 08:41:15 UTC 2022 aarch64 GNU/Linux
I’ve managed to run a stress test like the one i shared above for 30 minutes without having the nvme remount.
Smart complained, but so far it is way better:
Device: /dev/nvme0, number of Error Log entries increased from 476187 to 476188

Scratch that, the error came back after a while

Hi, @jaysonsantos

We have a 3A with 23W PoE HAT setup running for some days, we can reproduce this issue now. We need more investigation on this issues. I will update if we have new finding.

hey @jack, thanks for the answer!
i noticed that in armbian with kernel 5.18, it does happen but less frequently.
could this [1] patch be benefical to be backported into armbian?
[1] https://lore.kernel.org/linux-arm-kernel/165459351568.925770.13686160465924068647.b4-ty@sntech.de/T/

1 Like

I notice that the armbian mainline kernel doesn’t have this patch, but my recent work on 5.19 rc1 has this patch involved: https://github.com/amazingfate/build/blob/rockchip64-5.19/patch/kernel/archive/rockchip64-5.19/rk356x-dts-pcie2x1.patch#L13
You can try my 5.19rc1 kernel: https://drive.google.com/drive/folders/1y4fYI87xFvOrChLhlyevPcZxeP8gHBQM?usp=sharing

Did you locate the problem?

Hi, i had found something about this problem, cloud you help me to collect some data?
i need some command result:

nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
smartctl -a /dev/nvme0

nvme command provided by ubuntu package nvme-cli
smartctl command provided by ubuntu package smartmontools

I also applied those ranges on 5.18 but still no luck, after 7 minutes I thought that it was working but after control-c-ing stress, it then failed.

nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
        Autonomous Power State Transition Enable (APSTE): Enabled
smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.18.4-rk35xx] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P2SSD8
Serial Number:                      2139E5D64ED1
Firmware Version:                   P2CR033
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 54e0000205
Local Time is:                      Sun Jun 19 20:24:24 2022 CEST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        0       0
 1 +     1.90W       -        -    1  1  1  1        0       0
 2 +     1.50W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     5000    1900
 4 -   0.0020W       -        -    4  4  4  4    13000  100000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    620,612 [317 GB]
Data Units Written:                 3,991,592 [2.04 TB]
Host Read Commands:                 5,228,749
Host Write Commands:                31,533,130
Controller Busy Time:               317
Power Cycles:                       606
Power On Hours:                     1,743
Unsafe Shutdowns:                   398
Media and Data Integrity Errors:    471,412
Error Information Log Entries:      476,199
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0     476199     0  0x3002  0x4005  0x028            0     0     -

thanks.
i have a solution now. follow this link change-kernel-params to add nvme_core.default_ps_max_latency_us=0 kernel param to disable nvme APST.
i think i should be work.

root@rock-3a:~# nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
        Autonomous Power State Transition Enable (APSTE): Enabled
root@rock-3a:~# smartctl -a /dev/nvme0
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.193-42-rockchip-ge29be2b2ed27] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON OM8PDP31024B-A01
Serial Number:                      50026B728286CF9A
Firmware Version:                   EDFK0S03
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 28286cf9a5
Local Time is:                      Mon Jun 20 15:29:21 2022 UTC
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.50W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.16W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     1000    1000
 4 -   0.0025W       -        -    4  4  4  4     5000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    6%
Data Units Read:                    1,751,074 [896 GB]
Data Units Written:                 1,761,510 [901 GB]
Host Read Commands:                 14,080,146
Host Write Commands:                13,157,894
Controller Busy Time:               151
Power Cycles:                       2,181
Power On Hours:                     205
Unsafe Shutdowns:                   1,936
Media and Data Integrity Errors:    0
Error Information Log Entries:      6,739
Warning  Comp. Temperature Time:    16751125
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   3
Thermal Temp. 2 Transition Count:   2
Thermal Temp. 1 Total Time:         17180229
Thermal Temp. 2 Total Time:         -1730788088

Error Information (NVMe Log 0x01, max 16 entries)
No Errors Logged

root@rock-3a:~#

This is what i get after 15 minutes with apst disabled

nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
rock-3a-worker-1:lib:# time stress -i 10 --hdd 10

stress: info: [8477] dispatching hogs: 0 cpu, 10 io, 0 vm, 10 hdd
Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.809626] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 25690714, error -30)
Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.811031] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 25690714, error -30)

Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.812267] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 25690714, error -30)
stress: FAIL: [8489] (563) mkstemp failed: Input/output error
stress: FAIL: [8477] (394) <-- worker 8489 returned error 1
stress: WARN: [8477] (396) now reaping child worker processes
stress: FAIL: [8477] (451) failed run completed in 903s
stress -i 10 --hdd 10  1.83s user 1882.89s system 208% cpu 15:02.95 total
:1: Input/output error: sed

(eval):1: Input/output error: sed

(eval):1: Input/output error: sed

(eval):1: Input/output error: sed

zsh: Input/output error: /var/mail/root

i could not reproduce this issue :joy::joy::joy: