RAID Failure Quad SATA HAT

A very short while after receiving my Quad SATA HAT for the Raspberry PI4 with totally new drives, tested SD cards, cables, psu etc, all hardware stuff i tested, i am experiencing connection issues for my hard drives. I updated the HAT with the firmware jms561 and can see all 4 hard drives.

When i create my snapRAID and mergerfs pool, the drives work for one day, and then they are missing and have errors. No one in the openmediavault community can help me with that issue. I have tested everything on hardware base.

Like that the SATA HAT is not usable.

Can someone please help me with that issue

First at all. There is a few threads regarding sata hat and snapRAID in context with RasPi. Did you read them?
Second. Please provide logs, like dmesg, of your system by using pastebin

I did read a lot of them and nothing is solving my issue, i didn’t know anything more i could do so i opened a topic.

I wondered how to share my logs because they are to big.

After reboot when everything is working fine.

After the drives go missing

I have checked every hard drive and they are brand new. There should be no bad sectors or anything to cause them to go missing. On my desktop they work fine.

Try reseating the drives… press harder when seating the drives too… I noticed that the connectors can be quite finicky… reseat always solve any missing drive issue for me…

The worst case scenario would be bad jms chip…

I have checked them yesterday and they are all well seated to the ground of the connector. But i will later try to clean the connectors with alcohol, even tough they look clean, maybe i helps and there was some oxidation on the connctors.
It is just so weird, one day everything is fine then the drives go missing. After this happens a few times the drives won’t even connect on reboot anymore, they are just gone, until I plug out the cables and plug them back in then they appear on reboot again (still have to manually mount them) and the cycle repeats itself.
On my openmedivault post, we went from when certain drivers booting on what time to testing every piece of hardware one by one, but nothing seems to help. Maybe it is really the chip of the HAT.

Or maybe it’s usb and raspberry that is not the right tool for such task?
Many people has this problem with this hat (including me), when penta sata hat connected via m.2 slot just work ok. As far as I remember OMV does not support RAID for USB (not because they can’t, but it’s just problematic). The real answer to that it’s that USB is not great choice for RAID and mergefs is suggested.

That’s true, USB is not good for that but it will still take a bit of time until i switch to a board with sata over pcle and it would be sad if i cannot get my NAS to work until i upgrade.
When i had OMV5 it was rock solid for a long time, then now doing this stuff :frowning:

It will still work if You slightly change Your setup.

  1. if You drives comes back as different device (like sda to sde) setup udev to name device by it uuid so drive should be reconnected and then remounted at same place
  2. setup something different than RAID - there are several ways for that - like mentioned mergefs
  3. pay extra attention for usb connector - what tkaiser mentioned: Quad SATA Hat Disconnects
  4. always You can do some disaster recovery - get some monitoring and do restarts

I would not keep there any important data, but still may be ok for second backup or some media player.

Thanks for the help and suggestions.

To 1.
This happened to me in the beginning but since on the Radxa wiki of the Quad SATA HAT there was a “fix” with hdparm with OMV 5 and since i switched to OMV6 it could detect each hard drive with its own serial number and UUID so i don’t know it i have to do the udev setup.

To 2.
Yes maybe i should not use a RAID. The reason i wanted to use a snapraid in the first place is, that i wanted to restore my data of a parity drive. Of course a RAID is never a backup (My backup is on external hard drives) but it is still nice that i could do it if necessary, because it takes so long time to transfer 10Tb of data. But maybe as long as i still use this setup until i upgrade it in a few months, i should use just mergerfs without snapraid. But i would a pain to now delete the raid and just keep the data with mergerfs, and later with the new board create a raid again and move the data from the backup drives.

To 3.
Thanks I will. I am using the “USB bridge” from allnetchina in that set, so I think they should not have usb cable diameter issues?

To 4.
Yes i will do. I did reinstall OMV and my whole setup about 15 times already, but it all sticks to this point with the HAT.

1: is raid reassembling itself after failure? maybe mdadm needs some tweaks to find needed disks
2: usb is just not designed for raid and problematic, it’s know fact and omv don’t recommend using it for raid because of constant problems with it
3: even with that rough connector there is still an issue that can cause such effects. Whole things vibrate and may cause some electical problems - try from time to time to push it hard into case - if You feel any move then it’s slowly moving back
4: for that I meant you can sometimes detect some issue, reboot, repair. RAID repair may take seconds or hours

I have some news.

I now took the hard drives out of the snapraid cluster and wiped the parity drive. The other 3 of the drives are in a mergerfs cluster. The previously parity drive is mounted but out of the mergerfs cluster.

The machine is now somewhat reliable. It is stable for around one day (as before) and after reboot or crashes it reconnects all hard drives reliable until they are back online. That is at least some good news. But after around one day the system still reboots or reloads because a drive was missing (I believe). But every connector is pushed in tho the limit, nothing loose in the case.

I am skipping through dmesg and journalctl -xe but i am to less of a pro to spot something what is causing the problem or what OMV pushes to reboot. Above with a drives missing is just a guess but i don’t know.

How You power up it? Maybe here is some issue?
Also big temperature does not help

I power it up over the HAT with a 90W power supply. Temperatures are at low 40s. So should be no problem

Which revision is Yours? 1.1 uses usb-pd and 1.2 has 12V barrel plug.
Temperature and power is always something to consider when talking about stability. I wonder how You managed to get 40°C? Mine at idle is 50.1°C now :confused:

Sorry for the late reply,
if so then i use the revision 1.2 because i have the version with the 12 barrel plug.
I am just now doing tests and for now (1 day) it is running stable. I had a big problem with portainer which took me a while and a lot of forum talk to solve. But now since portainer is fresh, it somehow (until now) runs smooth.
Until i upgrade my system i wanted to make my system stable again so i removed the snapraid and just use mergerfs. Hopefully nothing breaks.

I get so low temeratures because i have a small fan cooler on top of the raspberry pi but in between the HAT and the PI (very small) but it only keeps the temps on 55°C on high load.
In addition to that i build a small case for the pi and the hard drives and have a noctua fan on each side of the case for constant airflow (smiliar to a server). The fans run at i believe the minimum rpm they offer (300) and are dead silent. I use a noctua fan controller to manually power and set the fan speed to a low rpm.
With that little bit of airflow the PI and my hard drives run on cool temperatures.

For now the case door is only hold with tape strips because i have to open it when i have to get to the hardware.

1 Like

ok, You added big case with even bigger cooling, that changes a lot for temperatures. Original case with its cooling that is rather weak and noisy gives about 50°C and up to 70°C when used. I’m looking for some changes to my design using some turbine fans, that should increase airflow and gain few degrees, but more important - less noise when used.