Radxa Rock 5b Hangs multiple times a week

And it should throttle if needed to stay stable.
I think that this problem was related to coral somehow, it’s getting quite hot on heavy workloads, especially with max package, as far as I remember it will throttle at about 115’C. According to specs it needs cooling from both sides to move heat from IC as well as its NPU.

Basically there are two libs for it, std and max, second one will use higher frequencies and throttle at higher temp. Also If You have dual edge tpu You don;t have that small heatsink.
If You used that in passive case then it’s clear why it added heat. Especially on high workloads when all resources are used. If You also added nvme on bottom then it’s another source of high temp, this depends on particular m.2 board.

I learned something about RAM on one of my (8gen) NUC, it was stable with 32GB RAM, but adding another 32GB caused instability. Then I needed to get it out of rack and tested on table, it was ok. So I needed to find what is wrong, quick memtest was always ok, but when I left that for night it showed some errors. faulty RAM? I switched them and same result. Then I placed big fan on top of that, night test passed. So it was stable when it was cold but on workloads after some time lower RAM module has switched several bits. Proper cooling solved issue and same unit work about 6 months with no issues.
If board is warm retest everything with big fan blowing directly at everything. If this solves issues then maybe on long workloads there is something related to temperature. This is especially true for passive cases where many are designed for mostly idle usage. Of course software, kernel is something to keep in mind too. Good luck :slight_smile:

Hello Mr. Vadim! I just had a different stability issue with my 32GB Rock. I had to patch new rkbins into my bootimage to get it going stable:
1: https://github.com/rockchip-linux/rkbin/blob/master/bin/rk35/rk3588_bl31_v1.45.elf
2: https://github.com/rockchip-linux/rkbin/blob/master/bin/rk35/rk3588_ddr_lp4_2112MHz_lp5_2400MHz_v1.16.bin

With the help of inindevs great u-boot repo:

(just wget the files into rkbin dir and change makefile accordingly)
I also use Inindevs image idk if it’s compatible or not but maybe you can adapt it smartly to your image :smiley:

Before I coudn’t allocate any bigger size of ram but now everything butter. (I changed main server with 32 GB and everything went nuts.)
Happy Easter maybe it helps

Edit: and if you try it, I guess it’s best to disconnect power one time on reboot. Maybe Its running slower without your coral and therefore more stable. Who knows :smiley:

Thanks for your comments and suggestions everyone. I wanted to give you an update that I haven’t seen stability issues after changing the power supply, so for now I’ll refrain from using compiled binaries from a random third party repo. If the issue will return I’ll try looking into inindev’s work to try understand if it’s safe to use his binaries.

As an experiment, I installed the coral TPU back on the board, but am not using it for inference. Just to test if the accelerator card wasn’t drawing too much current. After a few weeks I’ll start using it for inference, too - to continue testing. Frankly, I prefer the Rock NPU - it runs much cooler and the YOLOv5 model gives much better results than any of the coral.ai models - so once I’m done testing I’ll keep using the NPU… Will keep you posted.

1 Like

maybe you can provide a link howto compile an uboot according to this repo ?
I‘m not that deep in the topic of compiling uboot on my own.

2 Likes

Hi There,

you just have to download the non-random-non-third-party-rockchip binaries to your computer and use the script from inindev to download and compile uboot to make your life easier. So something like this:

apt install git
git clone -d 1 $REPOURL
cd uboot-rockship/rknn
wget file1
wget file2
nano …/Makefile -> Change the path for the two files
make

The script will tell you exactly what to do to flash the images and if you need to install more packages.

Oh and if it asks for git username it’s because it’s patching locally just insert 1:1 the example text “Your Username” etc. It wont upload anything. But just check the script first before using like everything if possibe :smiley:

Edit: here are some snippets:
In Makefile

RK3588_ATF := …/rkbin/rk3588_bl31_v1.45.elf
RK3588_TPL := …/rkbin/rk3588_ddr_lp4_2112MHz_lp5_2400MHz_v1.16.bin

TARGETS := target_rock-5b

I had to install these:

apt -y install screen bc libssl-dev python3-pyelftools python3-setuptools swig

git config --global user.email “you@example.com
git config --global user.name “Your Name”

mv rock-5b_idbloader.img idbloader.img
mv rock-5b_u-boot.itb u-boot.itb

Good luck!

Small Update: It’s also working for Joshua Rieks Ubuntu. Got rid of SIGSEGV error in Chrome when watching video (Now working for a while w/o error). Best.
Edit: Today I got the error again when I opened a new tab,had to restart to get it to work again… I think it’s still more stable than before.

1 Like

After a period of stability, unfortunately the issue happened again, so I can rule out the power brick… I am suspecting it’s a hardware issue.

Hi, @vadim

Can you try the latest B39 ROCK 5B build image? From the log, it seems to be a cpu or memory issue.

https://github.com/radxa-build/rock-5b/releases/download/b39/rock-5b_debian_bullseye_kde_b39.img.xz

Hi Jack, I’m currently using Ubuntu Focal with the latest updates from radxa repositories (kernel version 5.10.110-37-rockchip-g74457be0716d). Is the debian image superior in terms of its stability? I could try to reinstall but it’s not going to be a minor effort, hence trying to understand what we’re trying to achieve.

Update: Looking at the releases - I could do a release-upgrade do Jammy. Would that be acceptable?

Is the debian image superior in terms of its stability?

Yes. The Debian image release need to pass our QA test.

I could try to reinstall but it’s not going to be a minor effort

You can use another SD Card to install an image and test the stability.

Update: Looking at the releases - I could do a release-upgrade do Jammy. Would that be acceptable?

This is not suggested. It will break some Rockchip binary packages.

Okay, in this case I’ll move to Debian completely and give it a try. Want to have same workload as on Ubuntu, to remove any kind of ambiguity from the testing process.

hi @vadim,
maybe you take a look at Freezing Rock 5B with Rsync transfer ,
the issue I’m facing.

Hi @jack, after months of trial an error to get the problem resolved, killing an NVME drive in the process, I finally managed to get the latest image running.

While I would need more time to confirm whether the issue was resolved, I noticed that there are other kinds of problems with that image. For instance - updating the kernel to latest (6.1.43-19-rk2312) from the current (6.1.43-15-rk2312) breaks ethernet connectivity. You mentioned that debian OS goes through QA - is this a known issue? I reported it here just in case.

Last follow-up on the subject to close the thread. I haven’t had a single restart in the last three weeks, and consider the issue resolved. Remediation: replaced the NVME drive (despite wear being well below the stated lifetime). Changing between multiple OS and kernel versions did not affect the stability.

Was it Samsung 970 Evo by any chance? They seem to be too power hungry for rock 5b.

Nope, it was a 1TB Kioxia EXCERIA. Replaced it with a Crucial P3 plus and the problem went away. I tried removing all devices from the board, including unpowered USB an m.2 TPU, replaced the power supply with a 65w with power delivery (suspecting power issues) - nothing helped. Interestingly enough it worked just fine for a year or two before the issues started. Maybe it’s somehow related to component wear.

2 Likes

I also have instability issues. I think when I upgrade Armbian bookworm to the latest:
v24.8.4 for Rock 5B running Armbian Linux 6.11.7-edge-rockchip-rk3588

I also tried several PSUs: 9V 3A -> 12V 2A -> 12V 2.5A
I think it was more stable with 12C 2A. I could run several stress tests and no problem. Then after 2 days it failed again. Then the 12V 2.5A only lasted for 30mins, so I guess it depends on the kind of load.

I’m curious about @incognito his experience as I bought this board with with 1TB Evo 980 which is very fast and particularly with Armbian. When I measure power usage with a reasonable accurate powersocket it’s using 5W when idle. Running stress tests about 9W. It’s too slow to measure spikes but with a 12V * 2.5A => 30W power supply there seems to be sufficient capacity, but I’m not totally convinced.

@vadim although thew topic’s seems to be solved for you could you leave this topic open?

I am using a 12V3A dumb barrel power supply.

For a long time I just used eMMC as the boot drive and an Intel optane as a small ssd, but then I started having some trouble with eMMC not reading data correctly (probably this can be fixed by disabling CQE but I never figured out how). The system was stable.
Then I moved the system to a DRAM-less OEM Biwin SSD. These are its power states as listed by smartctl:

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

I sometimes have instability issues but they don’t seem to be caused by the SSD but by the hackjob Rockchip kernel. There were no reboots due to power (when the CPU is power starved by the power supply, the board just reboots).
I can test power usage later.

I also have a type-C PD 12V 2.5A power supply and the board runs stably with it

Here are my measurements, running stress-ng on a number of cores. Strangely, the power consumption seems much higher than yours. I am measuring using a Shelly Plug S with some other stuff connected too, so there is some background consumption, but it should not matter.

Baseline: 43.5

Power supply only: 44.1 (5B W) (each step adds)
Idle:              51.3 (7.2)
1 core:            52.8 (8.7)  (1.5)
2 core:            54.2 (10.1) (1.4)
3 core:            55.4 (11.3) (1.2)
4 core:            56.6 (12.5) (1.2)
8 core:            58.2 (14.1) (1.6)
Idle again:        51.3 (7.2)

Thanks for sharing. I don’t know why your consumption is higher. Distro maybe. Meanwhile I start to think that using searxng combined with gluetune docker compose services is at least triggering the issue. I seriously stressed the board: memory, file, cpu, gluetune stressed at the same time. No prob! I now wait not using the docker compose service combo to see it does not crash then. If this combo is triggering then I think it might be related to drivers. I remember a warning at the last upgrade that said something like: Missing certain network files. Never got a warning again but maybe it’s an indication that something is not well supported. So I wait and see to see if I can find some kind of pattern.