Ubuntu Server constantly crashing on Radxa Zero

Hi there, I’m having the same trouble with a new Radxa Zero I bought and then set up yesterday.

The first Radxa Zero I bought is not showing this problem. I’m trying to figure out what’s going on.

Some notes, but could be complete red herrings at this stage:

  • hardware differences: I notice my old Radxa Zero (v1.51) has a AW-CM256SM metal chip (next to the USB-C power connector) whereas the new one (also v1.51) has a chip that says AP6256.
  • software differences: I’m using the same Ubuntu image (I didn’t redownload anything for both), but the new one has been fully updated to the state of yesterday’s apt repositories whereas the old one hasn’t. I’ll report back with kernel versions etc later on if this seems related.

I’m trying the new Radxa Zero on a different power supply to see if that may be related — although the old Radxa Zero seems perfectly happy on this power supply even though it’s running some software and has a USB microphone+speaker plugged in to the user port.

As for the fault:
I was wondering if it was just Wifi shutdown, but I don’t think it is. Not only do SSH connections die, but so do adb shell connections (and adb lists the device as offline once it’s dead: you can’t reconnect).
I have not yet tried to get a serial connection to see if there’s a kernel panic listed or whether it’s still controllable through there (but if both SSH and ADB are broken I would expect not to be able to).

On the serial connection, when the board died I got this line (and this line only):

[ 3558.042771] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __cpufreq_driver_target+0x650/0x6b0  

On the other hand, I didn’t only buy one board this time; I have another (with the same chip as the ‘new’ board that is dying) and this one is happy to stay alive for hours.
I have decided to try and reflash the dying board again, ensuring to properly sync after dd and give it time to flush caches, before rebooting it. I will report back on whether this solves my problem.

It crashed against after being reimaged, but I’ve updated to the latest apt repository (same as the working one from the new batch) and will leave it to sit on the new kernel version. Fingers crossed it doesn’t crash again?

Edit: it crashed again. Maybe it’s a defective unit?

Edit2: since I’m collecting crash logs … it crashed again, this time with this in the serial logs:

[ 2290.770021] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[ 2290.773277] Mem abort info:
[ 2290.775942]   ESR = 0x96000044
[ 2290.778962]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 2290.784254]   SET = 0, FnV = 0
[ 2290.787240]   EA = 0, S1PTW = 0
[ 2290.790346] Data abort info:
[ 2290.793188]   ISV = 0, ISS = 0x00000044
[ 2290.796993]   CM = 0, WnR = 1
[ 2290.799919] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000001d73000
[ 2290.806341] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
[ 2290.813073] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[ 2290.818577] Modules linked in: rfcomm algif_hash algif_skcipher af_alg bnep snd_soc_hdmi_codec usb_f_fs libcomposite dw_hdmi_i2s_audio hci_uart btqca btrtl btbcm btintel bluetooth ecdh_gen
eric ecc brcmfmac fusb302 tcpm typec brcmutil meson_dw_hdmi meson_saradc meson_drm snd_soc_meson_g12a_tohdmitx dw_hdmi snd_soc_meson_codec_glue cfg80211 drm_kms_helper rfkill reset_meson_audi
o_arb snd_soc_meson_axg_tdmout snd_soc_meson_axg_frddr snd_soc_meson_axg_fifo snd_soc_meson_axg_sound_card snd_soc_meson_axg_tdm_interface cec snd_soc_meson_card_utils snd_soc_meson_axg_tdm_f
ormatter meson_rng meson_vdec(C) snd_soc_core v4l2_mem2mem ac97_bus videobuf2_dma_contig panfrost videobuf2_memops snd_pcm_dmaengine videobuf2_v4l2 gpu_sched videobuf2_common snd_pcm videodev
 snd_timer mc snd meson_canvas soundcore display_connector sch_fq_codel ramoops reed_solomon efi_pstore drm drm_panel_orientation_quirks ip_tables x_tables autofs4 axg_audio sclk_div clk_phas
e rtc_meson_vrtc
[ 2290.903460] CPU: 3 PID: 1211 Comm: kworker/u8:4 Tainted: G         C        5.10.69-13-amlogic-g104342c59952 #amlogic
[ 2290.913081] Hardware name: Radxa Zero (DT)
[ 2290.917106] Workqueue: brcmf_wq/mmc2:0001:1 brcmf_sdio_dataworker [brcmfmac]
[ 2290.924082] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[ 2290.930025] pc : __bitmap_and+0x20/0x80
[ 2290.933791] lr : select_task_rq_fair+0x8b8/0xb80
[ 2290.938365] sp : ffff800013ddb430
[ 2290.941627] x29: ffff800013ddb430 x28: ffff00000012f490
[ 2290.946913] x27: 0000000000000001 x26: ffff00000012f600
[ 2290.952174] x25: 000002155c885506 x24: 0000000000000003
[ 2290.957435] x23: ffff00006f9e17b0 x22: 0000000000000008
[ 2290.962696] x21: 0000000000000000 x20: 0000000000000001
[ 2290.967958] x19: 0000000000000003 x18: 0000000000000000
[ 2290.973219] x17: 0000000000000000 x16: 0000000000000015
[ 2290.978480] x15: 598b0ef0e14843c0 x14: 00000000000001ca
[ 2290.983741] x13: 00000000000001ca x12: 0000000000000000
[ 2290.989003] x11: 0000000000000000 x10: 0000000000000001
[ 2290.994264] x9 : ffff80005e5a2000 x8 : 0000000000000004
[ 2290.999525] x7 : 0000000000000000 x6 : 000000000000000f
[ 2291.004786] x5 : 0000000000000000 x4 : 0000000000000001
[ 2291.010047] x3 : 0000000000000100 x2 : ffff0000041ebd08
[ 2291.015309] x1 : ffff00000012f490 x0 : ffff00006f9e17b0
[ 2291.020572] Call trace:
[ 2291.022956]  __bitmap_and+0x20/0x80
[ 2291.026417]  try_to_wake_up+0x15c/0x418
[ 2291.032892]  default_wake_function+0x1c/0x40
[ 2291.037110]  pollwake+0x74/0xa0
[ 2291.040202]  __wake_up_common+0x90/0x158
[ 2291.044016]  __wake_up_common_lock+0x80/0xd0
[ 2291.048151]  __wake_up_sync_key+0x20/0x30
[ 2291.052120]  sock_def_readable+0x40/0x78
[ 2291.055991]  tcp_data_ready+0x2c/0xd0
[ 2291.059591]  tcp_rcv_established+0x5ec/0x6f8
[ 2291.063834]  tcp_v4_do_rcv+0x90/0x268
[ 2291.067455]  tcp_v4_rcv+0xacc/0xb90
[ 2291.070894]  ip_protocol_deliver_rcu+0x40/0x208
[ 2291.075403]  ip_local_deliver_finish+0x64/0x80
[ 2291.079770]  ip_local_deliver+0x80/0x128
[ 2291.083609]  ip_rcv_finish+0x90/0xb0
[ 2291.087103]  ip_rcv+0x60/0x118
[ 2291.090025]  __netif_receive_skb_one_core+0x5c/0x88
[ 2291.094795]  __netif_receive_skb+0x18/0x68
[ 2291.098738]  process_backlog+0xa4/0x170
[ 2291.102436]  net_rx_action+0x168/0x3f8
[ 2291.106051]  efi_header_end+0x11c/0x268
[ 2291.109754]  do_softirq.part.15+0x68/0x78
[ 2291.113630]  do_softirq+0x20/0x30
[ 2291.116796]  netif_rx_ni+0x74/0x78
[ 2291.120102]  brcmf_netif_rx+0xc4/0x110 [brcmfmac]
[ 2291.124708]  brcmf_rx_frame+0x108/0x190 [brcmfmac]
[ 2291.129418]  brcmf_sdio_dataworker+0x130c/0x2498 [brcmfmac]
[ 2291.134907]  process_one_work+0x1e8/0x360
[ 2291.138849]  worker_thread+0x44/0x478
[ 2291.142463]  kthread+0x150/0x158
[ 2291.145637]  ret_from_fork+0x10/0x34
[ 2291.149175] Code: d2800006 d503201f f8647825 f8647847 (8a0700a5)
[ 2291.155279] ---[ end trace f8d7d628de2de0f6 ]---
[ 2291.159853] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 2291.166732] SMP: stopping secondary CPUs
[ 2292.212290] SMP: failed to stop secondary CPUs 1,3
[ 2292.213938] Kernel Offset: disabled
[ 2292.216534] CPU features: 0x0240002,20002004
[ 2292.220775] Memory Limit: none
[ 2292.223794] Rebooting in 10 seconds..
[ 2302.227856] SMP: stopping secondary CPUs
[ 2303.271233] SMP: failed to stop secondary CPUs 1,3

It rebooted itself this time, which is not what normally happens, so this might be yet another issue :confused:

Edit3: a second one of my boards out of this batch is doing this as well. I would be open to hearing suggestions but I think they may be defective.

i have a radxa zero and cannot run ubuntu server focal 20220801 too. it can boot to login screen then automatically reboot. don’t know why. so i have to use armbian- focal instead

which ArmBian version did you use?? The last ones marked as stable in Armbian webpage did not boot for me.

the latest stable armbian focal (ubuntu) can be found here

It’s interesting to hear that the Armbian image works. However in my case the Armbian image would intermittently not boot (see Radxa Zero intermittently won't boot; is my eMMC faulty?) so I’m a bit reluctant to keep distro hopping until I find an image that doesn’t show either problem. :confused:

The crashes became unbearable. I had to unplug my Rock Zero 2-3 times per hour in the last few days.
After that, I installed Manjaro Minimal version https://github.com/manjaro-arm/radxa-zero-images/releases/download/22.02/Manjaro-ARM-minimal-radxa-zero-22.02.img.xz and it’s been working without issues for a few days. Hope it continues working as expected

It might be worth putting this in /etc/modprobe.d/brcmfmac.conf and rebooting:
options brcmfmac fcmode=0 roamoff=1

I installed the official Armbian bullseye CLI image on my 1GB Zero, applied the above and then installed OS updates and compiled a Qt application without any problems - would have expected it to have crashed at least once by now.

Crashed on me again yesterday (or at least fell off the network) while merely booted and not doing anything in particular

Afraid I’ve given up on sorting this out and am returning the boards to ALLNET.
The one board I had that I thought wasn’t suffering from this problem had a couple of crashes (just much less frequently). It’s unfortunate but instability is a big big problem for the use cases I bought these boards for.

1 Like

Unfortunately I have to agree that the 1GB model at least isn’t stable enough to use for anything really

Hello @brath @skm @reivilibre, I have this issue and this has to do with timings that was on the kernel level. I cannot give any specifics but I have fixed this issue with the help of patches from @steev – thanks!. I have two boards but one of them crashes constantly. I will be uploading the kernel I compiled if you are interested in testing it. If you want to recompile your kernel, here is the patches I applied >

hack: make the loop match vendor loop · steev/linux@caba200 · GitHub

Hi reivilibre. I have received returns from Allnet. Can you share how did you reproduce this issue? Just let it sit in ldle?

I ran s-tui with stress-ng for over 4 hours and did not observe kernel panic.

Update: it is up for 17 hours now.

1 Like

Hi, I bought 6 radxa zero boards with different configurations (ram/emmc). 100% have issues. They did disconnect from wifi and ADB is not working. Made some “force” procedures to make the boards reconnect to wifi after a failure, worked for a while but at the end they crash. It looks like it is more like an image problem. Testing the boards for some hours is not a real example that the boards are stable due to the fact that they crash eventually over time. I have used the official image “radxa-zero-ubuntu-focal-server-arm64-20220801-0346-mbr.img” from this link https://github.com/radxa-build/radxa-zero/releases/tag/20220801-0213, at first glance it works but after a couple of days maybe weeks it crashes.

This is not a normal problem for a production batch. Your boards have excellent characteristics but by now are very unstable. Is there an official image that works with whatever wifi chip you used and that are stable?

We are planning to use a SBC for our company but we can’t rely, by now, in your products. Any help or support will be fully appreciated.

Hi @shad0vv! Did you have any good experience? Were you able to compile and make it work? any update will be fully appreciated!

Zero got pushed back on the release schedule so no images were out for quite a long time. Please give this test image a try:

https://github.com/radxa-build/radxa-zero/releases/download/test-build-4/radxa-zero_debian_bookworm-test_kde_t4.img.xz

1 Like

Ok! Will try but the “https://github.com/radxa-build/radxa-zero/releases/download/test-build-4/radxa-zero_debian_bookworm-test_cli_t4.img.xz

Currently we only officially support the desktop image.