Rock 5B - many memory errors after a while

incognito · January 22, 2025, 11:21pm

Hello,

I’ve been running Rock 5B as a server for some time (before this it was often collecting dust). I was booting off eMMC but I switched to an NVME rootfs (Joshua Riek’s Ubuntu 24.04, latest 6.1 kernel from this distro).

The reason I switched was because some checksums, for example when building an Armbian image, were failing and I thought it was due to eMMC being corrupted or the command queue error (which I never attempted to mitigate).

However, after a while I started observing similar behaviour on the NVME. Some database files became corrupt for no reason etc.
I ran memtester and noticed many failures. This is the command for example:

sudo memtester 8G 5 - 5 passes, 8 GB. The errors appear instantly, here is an example:

FAILURE: possible bad address line at offset 0x00000000bbdc9200.
...
FAILURE: 0x0000000000000052 != 0x0000000000000008 at offset 0x0000000076b663f0.
FAILURE: 0x00000000000000f7 != 0x0000000000000008 at offset 0x0000000076b66a10.
FAILURE: 0x0000000000000048 != 0x0000000000000008 at offset 0x0000000076b66a20.
FAILURE: 0x00000000000000ba != 0x0000000000000008 at offset 0x0000000076b66a30.
FAILURE: 0x0000000000000091 != 0x0000000000000008 at offset 0x0000000076b66a70.
FAILURE: 0x00000000000000a2 != 0x0000000000000008 at offset 0x0000000076b66a90.
FAILURE: 0x00000000000000f8 != 0x0000000000000008 at offset 0x0000000076b66ab0.
FAILURE: 0x0000000000000048 != 0x0000000000000008 at offset 0x0000000076b66ac0.
FAILURE: 0x00000000000000ef != 0x0000000000000008 at offset 0x0000000076b66ad0.
FAILURE: 0x0000000000000052 != 0x0000000000000008 at offset 0x0000000076b66af0.
...
FAILURE: 0xffffffffffffffe3 != 0xffffffffffffffff at offset 0x0000000076b40ad0.
FAILURE: 0xffffffffffffffef != 0xffffffffffffffff at offset 0x0000000076b40ae0.
FAILURE: 0xffffffffffffff08 != 0xffffffffffffffff at offset 0x0000000076b40af0.
FAILURE: 0xffffffffffffffbf != 0xffffffffffffffff at offset 0x0000000076b40b00.
FAILURE: 0xffffffffffffff4f != 0xffffffffffffffff at offset 0x0000000076b40b10.
FAILURE: 0xffffffffffffff2c != 0xffffffffffffffff at offset 0x0000000076b40b30.
FAILURE: 0xffffffffffffffe0 != 0xffffffffffffffff at offset 0x0000000076b40b50.
FAILURE: 0xffffffffffffff55 != 0xffffffffffffffff at offset 0x0000000076b40b70.
FAILURE: 0xfffffffffffffffe != 0xffffffffffffffff at offset 0x0000000076b40b80.
FAILURE: 0xffffffffffffffc9 != 0xffffffffffffffff at offset 0x0000000076b40b90.
FAILURE: 0xffffffffffffffef != 0xffffffffffffffff at offset 0x0000000076b40ba0.
FAILURE: 0xffffffffffffffa2 != 0xffffffffffffffff at offset 0x0000000076b40bb0.
...
FAILURE: 0x0000000000000005 != 0x0000000000000001 at offset 0x0000000076b65a00.
FAILURE: 0x0000000000000091 != 0x0000000000000001 at offset 0x0000000076b65a10.
FAILURE: 0x00000000000000cd != 0x0000000000000001 at offset 0x0000000076b65a30.
FAILURE: 0x0000000000000062 != 0x0000000000000001 at offset 0x0000000076b65a50.
FAILURE: 0x0000000000000098 != 0x0000000000000001 at offset 0x0000000076b65a70.
FAILURE: 0x0000000000000051 != 0x0000000000000001 at offset 0x0000000076b65a90.
FAILURE: 0x00000000000000cd != 0x0000000000000001 at offset 0x0000000076b65ab0.
FAILURE: 0x000000000000000d != 0x0000000000000001 at offset 0x0000000076b65ad0.
FAILURE: 0x0000000000000041 != 0x0000000000000001 at offset 0x0000000076b65ae0.
FAILURE: 0x0000000000000019 != 0x0000000000000001 at offset 0x0000000076b65af0.
FAILURE: 0x0000000000000068 != 0x0000000000000001 at offset 0x0000000076b65b10.
FAILURE: 0x0000000000000011 != 0x0000000000000001 at offset 0x0000000076b65b20.
FAILURE: 0x0000000000000013 != 0x0000000000000001 at offset 0x0000000076b65b30.
FAILURE: 0x0000000000000051 != 0x0000000000000001 at offset 0x0000000076b65b50.
FAILURE: 0x0000000000000097 != 0x0000000000000001 at offset 0x0000000076b65b70.
FAILURE: 0x00000000000000fe != 0x0000000000000001 at offset 0x0000000076b65b90.

Basically every test ends in an error. The errors seem to appear more quickly after a reboot if I run the board with the default or performance set of governors (GPU, DMC, CPU) than when I use powersave (but eventually they also appear on powersave). I also tried to decrease the RAM speed but it did not help.

Is there a way to diagnose it better or maybe even fix? I don’t have any important data on the board now.

incognito · January 29, 2025, 10:05am

Bump. I tried another power supply, the same thing happens. Hardware wizards @tkaiser @boogiepop, please help

willy · January 29, 2025, 5:38pm

The fact that it only affects the same 8 bits really makes me think about either a dead DRAM chip, or a solder issue under a BGA chip. It could be useful to try to press on the DRAM chips during the tests to see if failures suddenly appear or disappear, as well as to try to slightly bend the board in one direction or the other and see if that changes. But I think that it’s a sign of a hardware defect in any case.

incognito · January 29, 2025, 6:34pm

Dang, thanks… Not sure how to do this as they are mostly under the fan. I will try.

How would you go about fixing this? I have no such skills. And how do you know that it affects the same 8 bits?

avaf · January 29, 2025, 6:41pm

Try adding mem=4G to your boot parameter and see what happens.

incognito · January 29, 2025, 7:18pm

Where are the boot parameters? I have a 16 GB board BTW. Runs on Joshua Riek’s Ubuntu currently.

avaf · January 29, 2025, 7:53pm

in my case:
/boot/extlinux/extlinux.conf

timeout 10
menu title select kernel
DEFAULT kernel-6.1

label kernel-6.1
    kernel /Image_6.1.43-rk3588-v4l2-cam
    initrd /initrd.img
    devicetreedir /dtbs/6.1.43-rk3588-v4l2-cam
    append earlyprintk console=ttyFIQ0,1500000n8 rw init=/sbin/init rootfstype=ext4 rootwait root=/dev/mmcblk1p2 net.ifnames=0 usbcore.autosuspend=-1 irqchip.gicv3_pseudo_nmi=0 coherent_pool=2M swiotlb=66560 mem=4G

boogiepop · January 29, 2025, 7:51pm

it can be due to a lot of reasons.

the ddr parameters are trained by a blob that is executed on your initial boot. It is a part of the spi image provided by radxa. If there is an issue in this blob and your ddr is trained wrong, updating to latest ddrbin might help.

github.com

rockchip-linux/rkbin/blob/master/doc/release/RK3588_EN.md

# RK3588 Release Note

## rk3588_bl31_v1.48.elf

| Date       | File                  | Build commit | Severity  |
| ---------- | :-------------------- | ------------ | --------- |
| 2024-12-19 | rk3588_bl31_v1.48.elf | 040d2de11    | important |

### New

1. Support to config the sleep pin for system suspend.
2. Support to resets the hptimer after system suspend.
3. Optimize the hptimer usage flow.

------

## rk3588_bl32_v1.19.bin

| Date       | File                  | Build commit | Severity  |
| ---------- | :-------------------- | ------------ | --------- |

This file has been truncated. show original

Yet this is a theory, nothing concrete.

i agree with willy, try to run the same test while slightly pushing the ddr chips. This will reduce the temp if there is any also might help to get transient contact if there is something wrong with the solder balls.

Also this, little board is not zoned thermally, if you have an excessive heat from somewhere, it might impact another component like ddr. Such heat should never damage the solder balls, but running the chip under constant heat degrades the quality of the wafer. In your case it might be the ddr.

EDIT: another troubleshooting method could be to set dmc governor to userspace, and set dram freq to individual frequencies and run tests seperately to check if the error happens more in higher freqs.

willy · January 29, 2025, 10:49pm

at least do not push on it with a metallic device (e.g. no screw driver), as you have high chances of breaking a corner of the chip if you push too hard, and it will not push evenly. If not accessible enough, trying to bend the board will achieve the same results anyway. Be gentle with it, no more than 1-2mm

It’s not really fixable without a hot air gun or an IR soldering station. I once saved my old laptop with a hot air gun but it lasted only two weeks, because solder joints are generally not well fixed. Usually what repair shops do is to remove the chips, clean everything, reball it and solder it again. Not worth the hassle for the price. And if it’s under warranty you shouldn’t wait too much.

It’s in the outputs: “0xXXXXXX != 0xXXXXYY”. Only the lowest 8 bits differ in the tests, and apparently any value appears so there are not even any joint bits (could happen if a tiny ball of solder or wire gets stuck under a device between two pins, but that’s very hard to do with BGA). Also, the problem could be under the CPU and not the DRAM as well.

incognito · January 29, 2025, 11:38pm

Thanks @avaf, this did limit the memory to 4 GB but also made the board crash quite quickly (ssh unresponsive after a few sec, hard shutdown needed).

@boogiepop I did try limiting the ram frequency by echoing the lower allowed values to dmc_max_freq or something like this (no improvement).

@willy the board is 2 years old so out of warranty for sure. If was bad ram, I know there is a way to mask this on x86, not sure about arm. Will do the"touch test"…

willy · January 30, 2025, 4:08am

It’s not really a matter of arm vs x86 regarding how to mask bad RAM, it’s a matter of what is bad. What you can do on x86 is to mask an area. Here it seems the whole RAM is bad on certain bits, which really looks like a desoldered chip or a fried chip. Since a same is mapped from the first to the last byte of RAM, you’re a bit out of luck when it comes to disabling an area.

To be honest, your descriptions about “boots then crashes” etc makes me think much more about fried chips than bad solder joints, because bad solder joints are generally more “on-vs-off” than “crashes all the time”.

There might be something possible however. Usually such SoC support working with only 32-bit DRAM (single-chip). Maybe if you find a DDR SPL that configures the DDR controller to work on a single chip, it will be able to train only one chip and use that one only. You’ll divide the RAM in half and stop using the fried chip. I’m unsure if 32-bit DDR SPL files are available for this SoC however, and I can’t find anything obvious from https://github.com/rockchip-linux/rkbin/tree/master/bin/rk35 (though maybe it’s possible to rebuild them oneself).

Now if the board is out of warranty, maybe you can try more invasive tests, including finding someone with a hot air gun who can try to replace your chips, but I’m not sure it’s worth it. And if you discover it’s the SoC which is fried, you will have lost your money :-/

boogiepop · January 30, 2025, 1:11pm

you should set it thorugh userspace/set_freq
ie:

[root@alarm dmc]# echo userspace > governor
[root@alarm dmc]# cat available_frequencies
528000000 1068000000 1560000000 2112000000
[root@alarm dmc]# echo 528000000 > userspace/set_freq
[root@alarm dmc]# cat cur_freq
528000000
[root@alarm dmc]# pwd
/sys/devices/platform/dmc/devfreq/dmc

incognito · January 30, 2025, 8:12pm

Sure, but is this not equivalent to, for example:
echo 1068000000 |sudo tee /sys/class/devfreq/dmc/max_freq when keeping dynamic governors? It won’t go higher at least.

For now, I limited the available memory to 8G using @avaf’s method and so far the errors don’t appear even on all governors set to performance (w/ stress-ng running on 6 threads on the background). Maybe 1 RAM chip is dead? Does this setting make the board use 1 chip or half of both chips?

willy · January 30, 2025, 10:23pm

You’re still using all chips when you halve the memory. There’s a total of 64 data lines going to the RAM and they’re used in parallel (32 bits total per chip). So if it works at 8G, it eliminates any doubt about bad solder joints and just means that one of your chip is partially fried, but only in the area above 8G. Maybe you’ll figure a reasonable limit (e.g. 12G or so).

incognito · January 30, 2025, 10:23pm

thanks, I think this is solved for now, hopefully the rest of the chip doesn’t die.

The board survived 10 memtest passes with no errors, I think it should behave now.

incognito · February 1, 2025, 4:48pm

20 memtests (3G) fine with the 8 GB limit, risking 12 GB now. 1 memtest went well.
15G, 14G failed, 13G going well so far. Losing 3 GB would not be the end of the world, but still annoying.
Testing!

top - 14:39:06 up 12 min,  2 users,  load average: 1.04, 1.10, 0.74                       
Tasks: 238 total,   2 running, 236 sleeping,   0 stopped,   0 zombie                      
%Cpu(s): 12.9 us,  2.0 sy,  0.0 ni, 85.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  12940.5 total,     92.0 free,  12797.3 used,    187.6 buff/cache               
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    143.1 avail Mem                    
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND               
3978 root      20   0   12.0g  12.0g   1132 R  92.3  95.0   6:03.35 memtester

Sdig · February 10, 2025, 12:54am

Check VDD_DDR_S0 after cold boot and monitor the voltage. It must read 850 mV, stable. If it fluctuates, means the inductor is damaged, and the buck switching (Buck 5) is not providing stable voltage. With more current demand, the voltage can reach high levels resulting in corrupt behaviour starting with the DRAM. I experienced this and was reproducible after replacing the inductor. I think the main reason it happens is when there is a significant reverse current that could take place when you remove the power supply while the board is operating, creating transient voltage high enough to damage the inductor. A more resilient inductor could sustain these transient loads. I still need to dig in a bit more for the root cause.

incognito · February 10, 2025, 3:41pm

Thanks but can you explain this to me like I’m 5? How do I check VDD_DDR_S0? The rest of your post was also barely understandable - I don’t have electronics background.
Also, if this was the case, why would it work when reduced from 16 to 13 GB of RAM? I did 2 burn-in tests of 20 memtester runs, one with 6-core stress-ng load running in the background and so far there were 0 errors.

Sdig · February 10, 2025, 7:20pm

It’s a test to rule out some power supply issues that could result in unpredictable behavior. Yes, it can be also other potential problems. So to check the voltage, you will need a multi-meter and read across one of the caps such as this one for the rock 5B.

buck5