Radxa Rock 5b Hangs multiple times a week

Update - after moving the workloads to the build-in NPU, the overall temps decreased, however, the issues persisted. I am giving it a last try with another power supply. Moved it to a dedicated 65W GaN2 power supply, let see if it helps.

Was coral that hot? have You used basic package or max version?

Not sure what you mean by max version or basic package. I’ll try to give more detail below:

The card is a dual TPU coral M2 card, however, only one of the TPUs is detected by the OS. Probably due to only one lane being available. I tried running it in this metal case and the card would throttle as the temperature would exceed 80-something degree Celsius after a few seconds of operation. I added a heat sync to the card and removed the plastic front and back panels to allow for air circulation. The card would operate at around 72-75C Better, but not ideal. I bought a 120mm fan, and added it to the setup, blowing at an angle at the whole case (cooling the RK3588 SoC) but also penetrating inside to cool the TPU’s smaller heat sync. The fan would turn on when the SOC would report 60C or above, and turn it off when the temperature would drop to 40C.

After removing the TPU, and using the built-in NPU (much better results on CodeOwners.AI by the way), the SoC temperature never went above 59C (mostly running at around 55-56C), even though I’m continuously running object detection (multiple times a second) from multiple cameras around my home.

Also, FYI, after changing the power supply I haven’t seen it hang, but it might be just a coincidence - I’ll keep monitoring it.

From our test, the Radxa Metal Case is enough for the heat dispatching at full load at room temperature. The surface of the metal case is hot(~ 60C), but the ROCK 5B will not hang or reboot.

2 Likes

And it should throttle if needed to stay stable.
I think that this problem was related to coral somehow, it’s getting quite hot on heavy workloads, especially with max package, as far as I remember it will throttle at about 115’C. According to specs it needs cooling from both sides to move heat from IC as well as its NPU.

Basically there are two libs for it, std and max, second one will use higher frequencies and throttle at higher temp. Also If You have dual edge tpu You don;t have that small heatsink.
If You used that in passive case then it’s clear why it added heat. Especially on high workloads when all resources are used. If You also added nvme on bottom then it’s another source of high temp, this depends on particular m.2 board.

I learned something about RAM on one of my (8gen) NUC, it was stable with 32GB RAM, but adding another 32GB caused instability. Then I needed to get it out of rack and tested on table, it was ok. So I needed to find what is wrong, quick memtest was always ok, but when I left that for night it showed some errors. faulty RAM? I switched them and same result. Then I placed big fan on top of that, night test passed. So it was stable when it was cold but on workloads after some time lower RAM module has switched several bits. Proper cooling solved issue and same unit work about 6 months with no issues.
If board is warm retest everything with big fan blowing directly at everything. If this solves issues then maybe on long workloads there is something related to temperature. This is especially true for passive cases where many are designed for mostly idle usage. Of course software, kernel is something to keep in mind too. Good luck :slight_smile:

Hello Mr. Vadim! I just had a different stability issue with my 32GB Rock. I had to patch new rkbins into my bootimage to get it going stable:
1: https://github.com/rockchip-linux/rkbin/blob/master/bin/rk35/rk3588_bl31_v1.45.elf
2: https://github.com/rockchip-linux/rkbin/blob/master/bin/rk35/rk3588_ddr_lp4_2112MHz_lp5_2400MHz_v1.16.bin

With the help of inindevs great u-boot repo:

(just wget the files into rkbin dir and change makefile accordingly)
I also use Inindevs image idk if it’s compatible or not but maybe you can adapt it smartly to your image :smiley:

Before I coudn’t allocate any bigger size of ram but now everything butter. (I changed main server with 32 GB and everything went nuts.)
Happy Easter maybe it helps

Edit: and if you try it, I guess it’s best to disconnect power one time on reboot. Maybe Its running slower without your coral and therefore more stable. Who knows :smiley:

Thanks for your comments and suggestions everyone. I wanted to give you an update that I haven’t seen stability issues after changing the power supply, so for now I’ll refrain from using compiled binaries from a random third party repo. If the issue will return I’ll try looking into inindev’s work to try understand if it’s safe to use his binaries.

As an experiment, I installed the coral TPU back on the board, but am not using it for inference. Just to test if the accelerator card wasn’t drawing too much current. After a few weeks I’ll start using it for inference, too - to continue testing. Frankly, I prefer the Rock NPU - it runs much cooler and the YOLOv5 model gives much better results than any of the coral.ai models - so once I’m done testing I’ll keep using the NPU… Will keep you posted.

1 Like

maybe you can provide a link howto compile an uboot according to this repo ?
I‘m not that deep in the topic of compiling uboot on my own.

2 Likes

Hi There,

you just have to download the non-random-non-third-party-rockchip binaries to your computer and use the script from inindev to download and compile uboot to make your life easier. So something like this:

apt install git
git clone -d 1 $REPOURL
cd uboot-rockship/rknn
wget file1
wget file2
nano …/Makefile -> Change the path for the two files
make

The script will tell you exactly what to do to flash the images and if you need to install more packages.

Oh and if it asks for git username it’s because it’s patching locally just insert 1:1 the example text “Your Username” etc. It wont upload anything. But just check the script first before using like everything if possibe :smiley:

Edit: here are some snippets:
In Makefile

RK3588_ATF := …/rkbin/rk3588_bl31_v1.45.elf
RK3588_TPL := …/rkbin/rk3588_ddr_lp4_2112MHz_lp5_2400MHz_v1.16.bin

TARGETS := target_rock-5b

I had to install these:

apt -y install screen bc libssl-dev python3-pyelftools python3-setuptools swig

git config --global user.email “you@example.com
git config --global user.name “Your Name”

mv rock-5b_idbloader.img idbloader.img
mv rock-5b_u-boot.itb u-boot.itb

Good luck!

Small Update: It’s also working for Joshua Rieks Ubuntu. Got rid of SIGSEGV error in Chrome when watching video (Now working for a while w/o error). Best.
Edit: Today I got the error again when I opened a new tab,had to restart to get it to work again… I think it’s still more stable than before.

1 Like

After a period of stability, unfortunately the issue happened again, so I can rule out the power brick… I am suspecting it’s a hardware issue.

Hi, @vadim

Can you try the latest B39 ROCK 5B build image? From the log, it seems to be a cpu or memory issue.

https://github.com/radxa-build/rock-5b/releases/download/b39/rock-5b_debian_bullseye_kde_b39.img.xz

Hi Jack, I’m currently using Ubuntu Focal with the latest updates from radxa repositories (kernel version 5.10.110-37-rockchip-g74457be0716d). Is the debian image superior in terms of its stability? I could try to reinstall but it’s not going to be a minor effort, hence trying to understand what we’re trying to achieve.

Update: Looking at the releases - I could do a release-upgrade do Jammy. Would that be acceptable?

Is the debian image superior in terms of its stability?

Yes. The Debian image release need to pass our QA test.

I could try to reinstall but it’s not going to be a minor effort

You can use another SD Card to install an image and test the stability.

Update: Looking at the releases - I could do a release-upgrade do Jammy. Would that be acceptable?

This is not suggested. It will break some Rockchip binary packages.

Okay, in this case I’ll move to Debian completely and give it a try. Want to have same workload as on Ubuntu, to remove any kind of ambiguity from the testing process.

hi @vadim,
maybe you take a look at Freezing Rock 5B with Rsync transfer ,
the issue I’m facing.

Hi @jack, after months of trial an error to get the problem resolved, killing an NVME drive in the process, I finally managed to get the latest image running.

While I would need more time to confirm whether the issue was resolved, I noticed that there are other kinds of problems with that image. For instance - updating the kernel to latest (6.1.43-19-rk2312) from the current (6.1.43-15-rk2312) breaks ethernet connectivity. You mentioned that debian OS goes through QA - is this a known issue? I reported it here just in case.

Last follow-up on the subject to close the thread. I haven’t had a single restart in the last three weeks, and consider the issue resolved. Remediation: replaced the NVME drive (despite wear being well below the stated lifetime). Changing between multiple OS and kernel versions did not affect the stability.

Was it Samsung 970 Evo by any chance? They seem to be too power hungry for rock 5b.

Nope, it was a 1TB Kioxia EXCERIA. Replaced it with a Crucial P3 plus and the problem went away. I tried removing all devices from the board, including unpowered USB an m.2 TPU, replaced the power supply with a 65w with power delivery (suspecting power issues) - nothing helped. Interestingly enough it worked just fine for a year or two before the issues started. Maybe it’s somehow related to component wear.

2 Likes