I/O errors: NVME to SATA adapter doesn't work well?

aciceri · December 9, 2024, 2:43pm

I’ve a Rock5B (powered by a traditional PC PSU) with two disks attached through this adapter: an SSD and an HD (Exos X14, 10TB).

The SSD never gave me a single problem (formatted as EXT4), instead the HD cyclically stopped working ending in a state where even fsck wasn’t helping (luckily I make daily backups).
I thought it was due to bcachefs (the HD was formatted in that FS) so after having to reformat 2 times I chose to try a different FS (xfs, which should be much more stable) but still after some days I’m starting to see again weird errors with dmesg:

[40602.032058] ata4.00: exception Emask 0x10 SAct 0xffffc0 SErr 0x49d0000 action 0xe frozen
[40602.032778] ata4.00: irq_stat 0x00400000, PHY RDY changed
[40602.033250] ata4: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq DevExch }
[40602.033877] ata4.00: failed command: READ FPDMA QUEUED
[40602.034326] ata4.00: cmd 60/20:30:10:1b:7c/00:00:01:03:00/40 tag 6 ncq dma 16384 in
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[40602.035695] ata4.00: status: { DRDY }
[40602.036035] ata4.00: failed command: READ FPDMA QUEUED
[40602.036483] ata4.00: cmd 60/20:38:30:1b:7c/00:00:01:03:00/40 tag 7 ncq dma 16384 in
                        res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[40602.037849] ata4.00: status: { DRDY }
[40602.038172] ata4.00: failed command: READ FPDMA QUEUED
[40602.038620] ata4.00: cmd 60/20:40:50:1b:7c/00:00:01:03:00/40 tag 8 ncq dma 16384 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[40602.039991] ata4.00: status: { DRDY }
[40602.040314] ata4.00: failed command: READ FPDMA QUEUED
[40602.040762] ata4.00: cmd 60/20:48:70:1b:7c/00:00:01:03:00/40 tag 9 ncq dma 16384 in
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[40602.042127] ata4.00: status: { DRDY }

Or

[40619.286796] I/O error, dev sda, sector 12909812592 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[40619.287612] sd 3:0:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=18s
[40619.287615] sd 3:0:0:0: [sda] tag#10 Sense Key : 0x5 [current] 
[40619.287618] sd 3:0:0:0: [sda] tag#10 ASC=0x21 ASCQ=0x4 
[40619.287621] sd 3:0:0:0: [sda] tag#10 CDB: opcode=0x88 88 00 00 00 00 03 01 7c 1b 90 00 00 00 20 00 00

According to smartctl the disk is good but starting the long test the test itself fails. Instead the smart metrics seems ok. I’ve tried changing the cable but it doesn’t help. I’ve currently attached to disk to another machine where I’m running the long test without any error in dmesg, so I’m starting to think that it may be the controller’s fault (even if for the SSD it works flawlessly?). Otherwise it could be a power issue but I can’t see how, I don’t remember the exact specifics of the PSU but I removed it from a PC which certainly consumed more than the Rock5B + 1 SSD + 1 HD + 1 fan.

I’m tempted to buy another controller and see what happens but perhaps someone can suggest a model which works for sure?
Otherwise it could be a software problem but I’ve no idea how to investigate in that direction. I’m running NixOS (mainline Linux 6.11) if it can help.

foxx1337 · December 10, 2024, 10:42pm

The sata controller overheats. Mine has an ASM1166 (as opposed to your JMB585) and I had to mount a Noctua NF-A4x10 to stop the signal drops. https://lastpixel.tv/tiny-home-server/

aciceri · December 11, 2024, 10:14am

Interesting, it may be the case (even if touching the radiator right now it doesn’t seem hot, but I should try when it’s under an heavy load).
How did you attach the fan? That controller (and mine too) doesn’t seem to have holes for the screws.

foxx1337 · December 11, 2024, 7:24pm

There’s a picture in the article, with the duct tape.

aciceri · December 13, 2024, 9:08am

Oh my bad, just saw it, I was looking at the other controller.
However if this is the case then why I never had problems with the SSD? The SSD could do even more writing/reading honestly since the OS is on it.

aciceri · December 16, 2024, 3:59pm

At the end the problem seems to be the PSU, the 2 disks were connected in parallel using the same cable (with multiple connectors). Attaching the HD to the other “PSU line” (that was unused) I don’t see any error anymore.