PCIe link fail on one of Rock4 B+

Hello
For some time I was sure that some nvme just don’t work on some models, so if someone confirmed nvme working on his device it should work on more different device with same model. Of course there are also some software that needs to be adjusted, sometimes some firmware. I found something strange on on my boards.

I have two ROCK 4B+ (same rev, same time, shop), both should have small kioxia nvme connected directly into slot (extending like big sd card). On one I installed radxa image, on second there was armbian for tests, both running from built in eMMC. At some point I updated both systems and found out that armbian can’t find nvme, it was not listed on lsblk, some pcie link errors on dmesg, I was sure that it was earlier there, maybe some update raised that problem so yet again I tried burning armbian, but this not improved anything so I decided to flash radxa image, same issue.
Then maybe nvme just died? I tested it on PC - it’s ok, updated it’s firmware to latest - nothing changes.
Ok, then I swapped it with first board, no luck. Again nvme works on first, but not second board. Then maybe m.2 slot is broken? I swapped big 2280 module and that one worked, so it’s not kioxia and it’s firmware, not slot, not system and of course - not power or cable (same thing works on second board).

What else can be different now? Of course I can try to find something else than this kioxia, but same setup works perfectly on other board. Are there any other firmware or loader that can be different? any ideas @radxa ?

Can you test some older images and see if this is a regression? Also some dmesg will be helpful.

Sure, what particular image can You suggest?
Also I can try to download image from working one to failing.

here is dmesg for working one: http://ix.io/4x5c
and for failing one: http://ix.io/4x5d

(it’s not latest dmesg, now both kernels are updated to #f9d1b1529)

Failed one has the following lines:

rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
rockchip-pcie f8000000.pcie: deferred probe failed
rockchip-pcie: probe of f8000000.pcie failed with error -110

Working kernel is 5.10.110-6-rockchip and failed kernel is 5.10.110-6-rockchip.

You can download the past kernels from here and see if old kernel fixes the issue.

Wrong kernel was first thing that I expected, but as I said they are now both running #f9d1b1529 (5.10.110-8-rockchip)
and issue is still there - nvme visible on rock1 and not visible on rock2
also it takes much more time to start for rock2 (about 55s vs 18s?)

Of course I saw that error in logs, but the question is why that happens on one device only? Any other things I may update?
BTW: with rsetup - can I check pcie link speed? Earlier there was some overlay to enable 2.1 on ROCK4. I checked it and cannot find anything to update on both boards.

Sorry I meant the failed kernel is 5.10.110-8-rockchip, so can you try downgrade the kernel?

I forgot to import this one when we upgrade the kernel. I have added it to our overlay repo now. You can use rsetup to build and use it locally before we roll out next kernel upgrade. However, I don’t think it is related to this specific issue though.

Kernel 5.10.110-8-rockchip is ok on second board with same nvme, earlier I tried -6 as well as other builds, all with same results. During tests rsetup just updated kernel to latest, but that did not make any difference.

Today I tried to connect UART to see messages before kernel starts,
but was not able to get anything on output with that board. I tried to update bootloader with rkdevtools to be sure that both boards are using same thing, but board stopped to boot at all. Right now green led is on and blue is glowing soon after power on. It should start from eMMC and it’s not doing that.

Maskrom mode is still working, I can download loader, upgrade it, upload image to eMMC but board fails to start with same effect every time. I also checked another power source, no luck. Right now except maskroom it does not boot at all, I already removed everything (nvme, sd, rtc, only cooling is left). Also tried to empty emmc in maskroom mode and burn sd card but board is still not booting on anything. Recent bootloader (1.27) is quite new but same thing with older 1.20.

Any ideas what to check now? Is board just died and nvme power/link problems were just beginning of bigger issue?

Try booting from a microSD card without messing with maskrom and rkdeveloptool. If that doesn’t give any output on the serial, your board is likely faulty.

We provide 1 year warranty on our products. You can contact the original seller for RMA.

This was one of numerous tries to make this hardware useful


but sadly didn’t help, so it was eventually reverted.

Of course I tried that already and there is no output on UART. I always get at least two boards to be able to compare. This particular one always had some issues with m.2 being unstable, but I did not expected that it will fail soon.
I will start return with allnet, but just before that - can You please explain what else can I try to check with that board. It enters maskrom mode - i can read and write to eMMC and do all rkdev tasks. Is that glowing led indicating some power issues? If eMMC failed then at least sd should start, but nothing on UART is bad sign.
Is there any chance that UART reports on different pins than on first board?

We use the same debug console placement on almost every products.

Thanks for link, interesting idea but also quite ugly workaround. If multiple attempts helps to get link connection that may be still just unstable. Here I could easily compare two boards, same pcie cards, software etc. Something was different about hardware and eventually board probably died :confused:

Sure, I wanted to be sure what else I could check.

I got one Rock3 board that at some point could not talk anymore with my UART console. It won’t start with it but works perfectly when that one is not connected. Something was updated during apt upgrade, I remember message about that on console soon before I got this problem. That’s why I trying to find out what else can be done if UART is not option.

The reason for ROCK 3 is that we didn’t enable the serial console in Linux kernel. However, you should still get output in U-Boot, and more critically, Rockchip blobs.

When you getting nothing on the serial it’s pretty much guaranteed that the board is not starting. If PSU and storage media are both good, then I can only conclude that the board itself has gone bad.

I just grabbed the latest ROCK 4B+ image built today for some other troubleshooting, and I tested it with a NVMe SSD. It is working here, so I think your issue is indeed hardware related.

On my Rock3 board - it won’t start boot process if UART is connected, it’s not trying to start. Yet again I have two such boards and yet again I can get UART output on first easily. I needed to update SPI with uboot and do nvme boot blindly and that works. UART world is just strange, I have few adapters to be sure that its not particular one.

Yes, works perfectly on my second Rock 4B+
I did not expected that board will die soon after nvme problems. What do You think failed? Power? SOC itself?

Can regular user try to diagnose anything on board before returning it? This one is still under warranty but I think that this may be helpful for others to know how to check it. For sure I’ll try to get some view under thermal cam if anything acts strange.

A lot of defects are hard to tell with naked eye, which is why factory uses X-ray to inspect the products.