Rock 3A - NVME SAMSUNG MZVL21T0HCLR-00BL7 Issues

Hi Guys,

I have an SAMSUNG MZVL21T0HCLR-00BL7 that will not work properly on a Rock 3A. Samsung family PM9A1. Same drive works fine on an Intel Galileo (4.9 kernel).

I have tried many things.

1.) Kernel 4.19, Kernel 5.19
2.) /boot/uEnv.txt adding: extraargs=nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
3.) remove/rescan:
echo 1 > /sys/bus/pci/devices/0002:21:00.0/remove

echo 1 > /sys/bus/pci/rescan

The drive always shows up on “lspci -n” here that is:

0002:21:00.0 0108: 144d:a80a

0108 is NVME of course.

But it does not show up on “lsblk” or “nvme -list” (nvme cli)

Occasionally we get the drives to work, but then while using them, they disappear again.

uname -a
Linux tsarm204 4.19.193-42-rockchip-ge29be2b2ed27 #rockchip SMP Wed Apr 20 01:31:50 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04 LTS
Release: 20.04
Codename: focal

Here is the “dmesg | grep nvme” from 4.19:

[ 2.558172] nvme nvme0: pci function 0002:21:00.0
[ 2.558407] nvme 0002:21:00.0: enabling device (0000 -> 0002)
[ 4.385290] nvme nvme0: Shutdown timeout set to 10 seconds
[ 35.476782] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 35.514239] print_req_error: I/O error, dev nvme0n1, sector 2000409088
[ 35.529954] nvme 0002:21:00.0: enabling device (0000 -> 0002)
[ 35.530046] nvme nvme0: Removing after probe failure status: -19
[ 35.553311] Buffer I/O error on dev nvme0n1, logical block 250051136, async page read

We supply separate power to the drive, so we believe our power budget is fine. We have a separate regulator for 3.3V power to the ARM. It has about 5A output (more peak surge) so about 16.5W.

Let me know any feedback or questions.

Thanks
James Meece

We believe the issue shows up clearly on “lspci -x”. On the Rock 3A, with Kernel 4.19., Bar 10 has “04” and then all 0’s, so there is no memory space allocated, which is why the drive never shows up fully on “nvme -list”. But for Intel Galileo, with Kernel 4.9, Bar 10 does have some space, it has “04 00 00 90”, so the 90 is key here.

Details:

TSARM213 - issue drive - Kernel 4.19 - Rock 3A
lspci -x
0002:21:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a
00: 4d 14 0a a8 00 00 10 00 00 02 08 01 00 00 00 00
10: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4d 14 01 a8
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00

TSGAL190 - issue drive - Kernel 4.9 - Intel Galileo
01:00.0 Class 0108: Device 144d:a80a
00: 4d 14 0a a8 06 04 10 00 00 02 08 01 00 00 00 00
10: 04 00 00 90 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4d 14 01 a8
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00

So Base Address 0 is all zero’s, why?

Every now and then, I get the drive to work, and BAR0 has an address, like this:

0002:21:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
00: 4d 14 0a a8 06 04 10 00 00 02 08 01 00 00 00 00
10: 04 00 90 80 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4d 14 01 a8
30: 00 00 00 00 40 00 00 00 00 00 00 00 6b 01 00 00

So clearly, something crashes, the drive gets removed from the bus, then no longer has a valid memory address.

Perhaps the drive gives a vendor unique response, and the OS does a reset PCI bus, which makes the drive disappear.

Using: nvme_core.default_ps_max_latency_us=1200 pcie_aspm=off

I have got the drive to do reads, writes and DST (Drive Self Test) but then it dies during a format command (nvme format /dev/nvme0n1 --namespace-id=0 --ses=1) and is no longer on LSBLK, but remains on LSPCI (but no BAR0).

Let me know any feedback. I could also get on a call or Discord if someone would like.

1 Like

When the drive disappears, it will give this response on “nvme - list”

NVMe status: ABORT_REQ: The command was aborted due to a Command Abort request(0x7)

Perhaps we have a hardware issue. We use a daughter card with a separate power supply. Removing this, and so far is ok. Will do more testing. Perhaps this was all a big red herring!

1 Like

The thing is 100s of other drives work with this setup, and since the drive was always on lspci, I thought the hardware should be ok (but always check your assumptions!).

The guys are looking into checking impendence now.