aghost
June 19, 2022, 3:14pm
21
Hi, i had found something about this problem, cloud you help me to collect some data?
i need some command result:
nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
smartctl -a /dev/nvme0
nvme command provided by ubuntu package nvme-cli
smartctl command provided by ubuntu package smartmontools
I also applied those ranges on 5.18 but still no luck, after 7 minutes I thought that it was working but after control-c-ing stress, it then failed.
nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
Autonomous Power State Transition Enable (APSTE): Enabled
smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.18.4-rk35xx] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: CT1000P2SSD8
Serial Number: 2139E5D64ED1
Firmware Version: P2CR033
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 54e0000205
Local Time is: Sun Jun 19 20:24:24 2022 CEST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W - - 0 0 0 0 0 0
1 + 1.90W - - 1 1 1 1 0 0
2 + 1.50W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 5000 1900
4 - 0.0020W - - 4 4 4 4 13000 100000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 620,612 [317 GB]
Data Units Written: 3,991,592 [2.04 TB]
Host Read Commands: 5,228,749
Host Write Commands: 31,533,130
Controller Busy Time: 317
Power Cycles: 606
Power On Hours: 1,743
Unsafe Shutdowns: 398
Media and Data Integrity Errors: 471,412
Error Information Log Entries: 476,199
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 476199 0 0x3002 0x4005 0x028 0 0 -
aghost
June 20, 2022, 2:10am
24
thanks.
i have a solution now. follow this link change-kernel-params to add nvme_core.default_ps_max_latency_us=0 kernel param to disable nvme APST.
i think i should be work.
root@rock-3a:~# nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
Autonomous Power State Transition Enable (APSTE): Enabled
root@rock-3a:~# smartctl -a /dev/nvme0
smartctl 6.6 2017-11-05 r4594 [aarch64-linux-4.19.193-42-rockchip-ge29be2b2ed27] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: KINGSTON OM8PDP31024B-A01
Serial Number: 50026B728286CF9A
Firmware Version: EDFK0S03
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 28286cf9a5
Local Time is: Mon Jun 20 15:29:21 2022 UTC
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 95 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.50W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.16W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 1000 1000
4 - 0.0025W - - 4 4 4 4 5000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 6%
Data Units Read: 1,751,074 [896 GB]
Data Units Written: 1,761,510 [901 GB]
Host Read Commands: 14,080,146
Host Write Commands: 13,157,894
Controller Busy Time: 151
Power Cycles: 2,181
Power On Hours: 205
Unsafe Shutdowns: 1,936
Media and Data Integrity Errors: 0
Error Information Log Entries: 6,739
Warning Comp. Temperature Time: 16751125
Critical Comp. Temperature Time: 0
Thermal Temp. 1 Transition Count: 3
Thermal Temp. 2 Transition Count: 2
Thermal Temp. 1 Total Time: 17180229
Thermal Temp. 2 Total Time: -1730788088
Error Information (NVMe Log 0x01, max 16 entries)
No Errors Logged
root@rock-3a:~#
This is what i get after 15 minutes with apst disabled
nvme get-feature /dev/nvme0 -f 0x0c -H | grep APST
rock-3a-worker-1:lib:# time stress -i 10 --hdd 10
stress: info: [8477] dispatching hogs: 0 cpu, 10 io, 0 vm, 10 hdd
Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.809626] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss! (inode 25690714, error -30)
Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.811031] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss! (inode 25690714, error -30)
Message from syslogd@rock-3a-worker-1 at Jun 21 07:37:41 ...
kernel:[ 1044.812267] EXT4-fs (nvme0n1p1): failed to convert unwritten extents to written extents -- potential data loss! (inode 25690714, error -30)
stress: FAIL: [8489] (563) mkstemp failed: Input/output error
stress: FAIL: [8477] (394) <-- worker 8489 returned error 1
stress: WARN: [8477] (396) now reaping child worker processes
stress: FAIL: [8477] (451) failed run completed in 903s
stress -i 10 --hdd 10 1.83s user 1882.89s system 208% cpu 15:02.95 total
:1: Input/output error: sed
(eval):1: Input/output error: sed
(eval):1: Input/output error: sed
(eval):1: Input/output error: sed
zsh: Input/output error: /var/mail/root
aghost
June 21, 2022, 10:30pm
27
i could not reproduce this issue
aghost
February 2, 2023, 8:43am
28
I found a possible solution to this problem.
The pcb layout of the pcie on the 3a seems to have some impedance issues, I made a m2 extension kit to reduce its effect, now it has been running k8s control-plane and etcd for 2 days with no nvme errors.
Here is a fix overlay:
/dts-v1/;
/plugin/;
/ {
fragment@0 {
target-path = <&pcie2x1>;
__overlay__ {
/delete-property/ vpcie3v3-supply;
};
};
};
If you encountered pcie down issues you can try this.
Save it as file rock3a-fix-pcie.dts and on armbian use command sudo armbian-add-overlay rock3a-fix-pcie.dts to install.
For other system, here is a compiled dtbo:
rock3a-fix-pcie.zip (308 Bytes)
1 Like
@James_Meece can you check if this overlay can help resolve your issue?
N_H
March 11, 2023, 10:49am
31
Hello,
I have the same problem with my PCI NVME. It shuts down after a few minutes of use when I write something onto it and then becomes unresponsive. Nothing works except for a reboot.
After I applied the fix using âcp rock3a-fix-pcie.dtbo /boot/dtbs/4.19.193-67-rockchip-g450948183988/rockchip/overlayâ, everything is cool now. At least, I canât seem to reproduce the error anymore.
Thank you all for the help.
As for your questions, itâs hard to say whether the fix will be integrated into Ubuntu and Debianâs OS. It would depend on whether the developers of those systems decide to include it in their updates.
Regarding using the fix for booting only with NVME, it might be possible, but I recommend consulting with an expert to make sure it wonât cause any issues.
N_H
March 11, 2023, 1:32pm
32
Unfortunately, the error has occurred again after a day and a night. Access to the NVME is no longer possible, but after a reboot everything is okay again. Here is the output:
root@rock-3a:/etc/samba# smartctl -a /dev/nvme0 smartctl 7.1 2019-12-30 r5022 [aarch64-linux-4.19.193-67-rockchip-g450948183988] (local build) Copyright © 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/nvme0 failed: Resource temporarily unavailable root@rock-3a:
you may try this new overlay:
/dts-v1/;
/plugin/;
/ {
fragment@0 {
target-path = <&pcie2x1>;
__overlay__ {
vpcie3v3-supply = <&vcc3v3_sys>;
};
};
};
Hey folks, I am using the same nvme as before but with armbian and these errors donât happen, maybe it is worth the try to migrate it.
N_H
March 12, 2023, 9:49am
35
Can you provide me with the new code as a finished file again (dtbo), please?
Or should we go straight to migration?
rock3a-fix-pcie.zip (354 Bytes)
Here is the new dtbo.
N_H
March 15, 2023, 12:02pm
37
Thank you for creating the file, and I have already uploaded it. Unfortunately, the test was not successful, and I am at a loss regarding the solution.
How can I add this dtbo to my ubuntu build? Also could this be the reason I am not able to boot from nvme?
Rock 3A NVME boot - ROCK 3 Series - Radxa Forum
No, this dtbo is confirmed not working.
@amazingfate
It looks like latest build of armbian works a lot better with NVME. Havent actually seen any issues. Why is that?
I really need ubuntu