Did I say the NanoPC’s were in a junk drawer? The original heatsinks from them were tossed in a bin, but the devices themselves are humming away in a prior version of my 5-node cluster (proudly wearing FriendlyElec’s active heatsink that they released later).
This thing is all part of a journey to find the happy-place of SBC-based clusters. Started with rpi-4, but the poor storage options become a bottleneck. Built out 5-node NanoPC-T4 and its NVMe storage options made it worlds better - but it still stumbles as the 4gb RAM capacity proves too limiting for me due to the large-ish required memory footprint of Prometheus. So its still running but not doing anything useful right now.
No pictures of this one because my 3d design skills were still forming and the result is too ugly to share.
Next stop was a rpi CM4 based cluster. I wanted to use TuringPi-2 but their kickstarter is taking (explative)-forever to deliver. Even if they do manage to get something out this year the compute modules are just flat out unavailable right now.
I also did a 6-node CM4 based cluster using the DeskPi Super6c:
This one was better. 8gb ram each and an NVMe drive. Really works well, Kube-Prometheus-Stack runs nicely (though whichever node Prometheus gets launched on is still under some memory pressure). Longhorn is a bit high latency (not sure why but disk latency averages about 4ms with frequent jitters >50ms). Issue with this is the lack of access to each node for troubleshooting. If any node (other than node 1) faults the only available repair is to power-cycle the whole cluster. There is no way to reset them individually (ugh!). This defeats a lot of the resiliency value of clustering. This is complicated by the fact that the nodes seem to randomly drop their NMVe drive. No idea why this happens. Could be something about the CM4, the Super6c or rpi Linux. But it averages one node faulted/week and this makes it hard to commit “real work” that you might count on to the cluster.
I really believe that this cluster with the Rock5’s is hitting the sweet sport for me. So far its rock solid reliable. If I do need to work on any node I can plug in HDMI and a keyboard from the front. For more aggressive resets I just drop power from the node, which I can even do remotely from the PoE switch. With the 16gb memory footprint the monitoring stack just fits nicely. Longhorn seems to work with much better performance due, I think, to the combination of PCIe x4 and the 2.5g networking.
I am really happy with this result. I may do some experiments with adding the NanoPCs and Super6c nodes in as additional compute in the future.