Hi All,
I have a RockPi 4B (4G of RAM) with the Penta SATA kit and I got everything setup no problem and was excited for this neat little NAS. Soon, however, it began locking up and freezing at seemingly random times. It was hard to catch these freezes because I’m running it headless, but a few times I was ssh’d in and saw some things like “Internal error: Oops: 96000004 [#1] PREEMPT SMP” dumped right before the lockup.
To make a long story short I ended up suspecting faulty hardware (most likely the memory). This suspicion came from the following tests:
- Identical lockups occurred with 4 different operating systems: Armbian Buster, Armbian Focal, and the official Radxa images for Debian and Ubuntu.
- The lockups occurred with many different physical eMMC modules
- Stress testing the CPU did not seem to increase the frequency of lockups
- The linux utility
memtester
showed many many faults across a wide range of memory sizes from 2G down to 16M. - The
memtester
faults occurred regardless of whether I was testing on the bare board or with it attached to the SATA HAT. - I have a much older RockPi 4B that has been in use for another project (no SATA HAT) nonstop for months and months. I ran
memtester
on it as a comparison and found no flaults after hours of running.
Really it was the huge number of memtester
faults that lead to my next test: I bought a copy of basically every part of the Penta SATA kit except for the case. Instead of a RockPi 4B I got a 4A this time, but still with 4G of memory. Because I was paranoid, the first thing I did was install a new OS (Armbian Focal with kernel 5.10) onto a newly purchased eMMC and ran memtester
. This was before I had attached any part of the SATA HAT. It ran for a very long time with no faults (so better that the first board!), but still had one fault on a 1G memory test. In any case I forged ahead.
The whole setup has been running much smoother than the first board, but I do get a very similar lockup every few days or so (even if the computer is just sitting there doing nothing). For whatever reason when the system locks up now I first notice it because the display on top of the HAT freezes. At this point I can still ssh in and grab some logs with journalctl
, but if I attempt to do basically anything else (including restart) the whole system locks up. This gives way more information that the first board, though, and maybe has some clue. Here’s an example:
Jul 05 06:27:30 HOST kernel: Unable to handle kernel paging request at virtual address 000000002f00d980
Jul 05 06:27:30 HOST kernel: Mem abort info:
Jul 05 06:27:30 HOST kernel: ESR = 0x96000004
Jul 05 06:27:30 HOST kernel: EC = 0x25: DABT (current EL), IL = 32 bits
Jul 05 06:27:30 HOST kernel: SET = 0, FnV = 0
Jul 05 06:27:30 HOST kernel: EA = 0, S1PTW = 0
Jul 05 06:27:30 HOST kernel: Data abort info:
Jul 05 06:27:30 HOST kernel: ISV = 0, ISS = 0x00000004
Jul 05 06:27:30 HOST kernel: CM = 0, WnR = 0
Jul 05 06:27:30 HOST kernel: user pgtable: 4k pages, 48-bit VAs, pgdp=000000002b65f000
Jul 05 06:27:30 HOST kernel: [000000002f00d980] pgd=0000000000000000, p4d=0000000000000000
Jul 05 06:27:30 HOST kernel: Internal error: Oops: 96000004 [#1] PREEMPT SMP
Jul 05 06:27:30 HOST kernel: Modules linked in: zram xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge aufs rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) snd_soc_hdmi_codec snd_soc_audio_graph_card panfrost snd_soc_simple_card snd_soc_simple_card_utils gpu_sched dw_hdmi_cec snd_soc_rockchip_i2s dw_hdmi_i2s_audio hantro_vpu(C) rockchip_vdec(C) v4l2_h264 snd_soc_es8316 rockchip_rga videobuf2_dma_contig v4l2_mem2mem videobuf2_dma_sg snd_soc_core videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common snd_pcm snd_timer videodev sg snd mc soundcore cpufreq_dt sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek
Jul 05 06:27:30 HOST kernel: rockchipdrm analogix_dp dw_hdmi dw_mipi_dsi dwmac_rk stmmac_platform drm_kms_helper stmmac pcs_xpcs cec rc_core drm drm_panel_orientation_quirks
Jul 05 06:27:30 HOST kernel: CPU: 3 PID: 3809147 Comm: python3 Tainted: P C OE 5.10.43-rockchip64 #21.05.4
Jul 05 06:27:30 HOST kernel: Hardware name: Radxa ROCK Pi 4A (DT)
Jul 05 06:27:30 HOST kernel: pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
Jul 05 06:27:30 HOST kernel: pc : lock_page_memcg+0x34/0xc0
Jul 05 06:27:30 HOST kernel: lr : lock_page_memcg+0x28/0xc0
Jul 05 06:27:30 HOST kernel: sp : ffff800029e238c0
Jul 05 06:27:30 HOST kernel: x29: ffff800029e238c0 x28: 00e000003b16dbc3
Jul 05 06:27:30 HOST kernel: x27: 0000000000000000 x26: ffff800029e23b08
Jul 05 06:27:30 HOST kernel: x25: 0000ffff927ae000 x24: ffff0000344a43e8
Jul 05 06:27:30 HOST kernel: x23: fffffe0000cc5b40 x22: 0000ffff927ad000
Jul 05 06:27:30 HOST kernel: x21: ffff800029e239f8 x20: fffffe0000cc5b40
Jul 05 06:27:30 HOST kernel: x19: ff0000002f00d000 x18: 0000000000000000
Jul 05 06:27:30 HOST kernel: x17: 0000000000000000 x16: 0000000000000000
Jul 05 06:27:30 HOST kernel: x15: 0000000000000001 x14: 0000000000000002
Jul 05 06:27:30 HOST kernel: x13: 000000000004111a x12: 0000000000000018
Jul 05 06:27:30 HOST kernel: x11: 0101010101010101 x10: ffff8000e6211000
Jul 05 06:27:30 HOST kernel: x9 : ffff0000f77966e0 x8 : 00000000000001ff
Jul 05 06:27:30 HOST kernel: x7 : ffff800029e238c0 x6 : ffff0000f77966f0
Jul 05 06:27:30 HOST kernel: x5 : 0000ffff927ad000 x4 : ffff0000344a43e8
Jul 05 06:27:30 HOST kernel: x3 : 00000000fffffecb x2 : 0000000000000001
Jul 05 06:27:30 HOST kernel: x1 : ffff0000010b4880 x0 : 0000000000000001
Jul 05 06:27:30 HOST kernel: Call trace:
Jul 05 06:27:30 HOST kernel: lock_page_memcg+0x34/0xc0
Jul 05 06:27:30 HOST kernel: page_remove_rmap+0x1c/0x568
Jul 05 06:27:30 HOST kernel: unmap_page_range+0x56c/0x848
Jul 05 06:27:30 HOST kernel: unmap_single_vma+0x88/0x100
Jul 05 06:27:30 HOST kernel: unmap_vmas+0xdc/0x100
Jul 05 06:27:30 HOST kernel: exit_mmap+0xd4/0x188
Jul 05 06:27:30 HOST kernel: mmput+0x7c/0x160
Jul 05 06:27:30 HOST kernel: begin_new_exec+0x2d4/0xa60
Jul 05 06:27:30 HOST kernel: load_elf_binary+0x73c/0x1800
Jul 05 06:27:30 HOST kernel: bprm_execve+0x28c/0x638
Jul 05 06:27:30 HOST kernel: do_execveat_common.isra.48+0x1a8/0x1c8
Jul 05 06:27:30 HOST kernel: __arm64_sys_execve+0x40/0x58
Jul 05 06:27:30 HOST kernel: el0_svc_common.constprop.2+0x8c/0x190
Jul 05 06:27:30 HOST kernel: do_el0_svc+0x24/0x90
Jul 05 06:27:30 HOST kernel: el0_svc+0x14/0x20
Jul 05 06:27:30 HOST kernel: el0_sync_handler+0x90/0xb8
Jul 05 06:27:30 HOST kernel: el0_sync+0x160/0x180
Jul 05 06:27:30 HOST kernel: Code: 97f92203 d503201f f9401e93 b40002d3 (b9498260)
Jul 05 06:27:30 HOST kernel: ---[ end trace 3474353eefa9fd6b ]---
Jul 05 06:27:30 HOST kernel: note: python3[3809147] exited with preempt_count 1
Jul 05 06:27:30 HOST python3[5056]: Process Process-3:
Jul 05 06:27:30 HOST python3[5056]: Traceback (most recent call last):
Jul 05 06:27:30 HOST python3[5056]: File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
Jul 05 06:27:30 HOST python3[5056]: self.run()
Jul 05 06:27:30 HOST python3[5056]: File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
Jul 05 06:27:30 HOST python3[5056]: self._target(*self._args, **self._kwargs)
Jul 05 06:27:30 HOST python3[5056]: File "/usr/bin/rockpi-penta/oled.py", line 108, in auto_slider
Jul 05 06:27:30 HOST python3[5056]: slider(lock)
Jul 05 06:27:30 HOST python3[5056]: File "/usr/bin/rockpi-penta/oled.py", line 101, in slider
Jul 05 06:27:30 HOST python3[5056]: for item in misc.slider_next(gen_pages()):
Jul 05 06:27:30 HOST python3[5056]: File "/usr/bin/rockpi-penta/oled.py", line 87, in gen_pages
Jul 05 06:27:30 HOST python3[5056]: {'xy': (0, 21), 'text': misc.get_info('ip'), 'fill': 255, 'font': font['11']},
Jul 05 06:27:30 HOST python3[5056]: File "/usr/bin/rockpi-penta/misc.py", line 48, in get_info
Jul 05 06:27:30 HOST python3[5056]: return check_output(cmds[s])
Jul 05 06:27:30 HOST python3[5056]: File "/usr/bin/rockpi-penta/misc.py", line 36, in check_output
Jul 05 06:27:30 HOST python3[5056]: return subprocess.check_output(cmd, shell=True).decode().strip()
Jul 05 06:27:30 HOST python3[5056]: File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
Jul 05 06:27:30 HOST python3[5056]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Jul 05 06:27:30 HOST python3[5056]: File "/usr/lib/python3.8/subprocess.py", line 512, in run
Jul 05 06:27:30 HOST python3[5056]: raise CalledProcessError(retcode, process.args,
Jul 05 06:27:30 HOST python3[5056]: subprocess.CalledProcessError: Command 'hostname -I | awk '{printf "IP %s", $1}'' died with <Signals.SIGSEGV: 11>.
Jul 05 06:27:30 HOST kernel: Unable to handle kernel paging request at virtual address 000000002f00d980
Jul 05 06:27:30 HOST kernel: Mem abort info:
Jul 05 06:27:30 HOST kernel: ESR = 0x96000006
Jul 05 06:27:30 HOST kernel: EC = 0x25: DABT (current EL), IL = 32 bits
Jul 05 06:27:30 HOST kernel: SET = 0, FnV = 0
Jul 05 06:27:30 HOST kernel: EA = 0, S1PTW = 0
Jul 05 06:27:30 HOST kernel: Data abort info:
Jul 05 06:27:30 HOST kernel: ISV = 0, ISS = 0x00000006
Jul 05 06:27:30 HOST kernel: CM = 0, WnR = 0
Jul 05 06:27:30 HOST kernel: user pgtable: 4k pages, 48-bit VAs, pgdp=0000000010aa1000
Jul 05 06:27:30 HOST kernel: [000000002f00d980] pgd=0000000010aa2003, p4d=0000000010aa2003, pud=0000000010aa3003, pmd=0000000000000000
Jul 05 06:27:30 HOST kernel: Internal error: Oops: 96000006 [#2] PREEMPT SMP
Jul 05 06:27:30 HOST kernel: Modules linked in: zram xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge aufs rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) snd_soc_hdmi_codec snd_soc_audio_graph_card panfrost snd_soc_simple_card snd_soc_simple_card_utils gpu_sched dw_hdmi_cec snd_soc_rockchip_i2s dw_hdmi_i2s_audio hantro_vpu(C) rockchip_vdec(C) v4l2_h264 snd_soc_es8316 rockchip_rga videobuf2_dma_contig v4l2_mem2mem videobuf2_dma_sg snd_soc_core videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common snd_pcm snd_timer videodev sg snd mc soundcore cpufreq_dt sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek
Jul 05 06:27:30 HOST kernel: rockchipdrm analogix_dp dw_hdmi dw_mipi_dsi dwmac_rk stmmac_platform drm_kms_helper stmmac pcs_xpcs cec rc_core drm drm_panel_orientation_quirks
Jul 05 06:27:30 HOST kernel: CPU: 1 PID: 5056 Comm: python3 Tainted: P D C OE 5.10.43-rockchip64 #21.05.4
Jul 05 06:27:30 HOST kernel: Hardware name: Radxa ROCK Pi 4A (DT)
Jul 05 06:27:30 HOST kernel: pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
Jul 05 06:27:30 HOST kernel: pc : lock_page_memcg+0x34/0xc0
Jul 05 06:27:30 HOST kernel: lr : lock_page_memcg+0x28/0xc0
Jul 05 06:27:30 HOST kernel: sp : ffff8000157b3a80
Jul 05 06:27:30 HOST kernel: x29: ffff8000157b3a80 x28: 00e000003b16dbc3
Jul 05 06:27:30 HOST kernel: x27: 0000000000000000 x26: ffff8000157b3cc8
Jul 05 06:27:30 HOST kernel: x25: 0000ffff927ae000 x24: ffff000021587bb8
Jul 05 06:27:30 HOST kernel: x23: fffffe0000cc5b40 x22: 0000ffff927ad000
Jul 05 06:27:30 HOST kernel: x21: ffff8000157b3bb8 x20: fffffe0000cc5b40
Jul 05 06:27:30 HOST kernel: x19: ff0000002f00d000 x18: 0000000000000000
Jul 05 06:27:30 HOST kernel: x17: 0000000000000000 x16: 0000000000000000
Jul 05 06:27:30 HOST kernel: x15: 0000000000000001 x14: 0000000000000002
Jul 05 06:27:30 HOST kernel: x13: 0000000000040fba x12: 0000000000000000
Jul 05 06:27:30 HOST kernel: x11: 0000000000000000 x10: ffff8000e61d1000
Jul 05 06:27:30 HOST kernel: x9 : ffff0000f77566e0 x8 : 0000000000000000
Jul 05 06:27:30 HOST kernel: x7 : ffff8000157b3a80 x6 : ffff0000f77566f0
Jul 05 06:27:30 HOST kernel: x5 : 0000ffff927ad000 x4 : ffff000021587bb8
Jul 05 06:27:30 HOST kernel: x3 : 00000000fffffecb x2 : 0000000000000001
Jul 05 06:27:30 HOST kernel: x1 : ffff0000053d4880 x0 : 0000000000000001
Jul 05 06:27:30 HOST kernel: Call trace:
Jul 05 06:27:30 HOST kernel: lock_page_memcg+0x34/0xc0
Jul 05 06:27:30 HOST kernel: page_remove_rmap+0x1c/0x568
Jul 05 06:27:30 HOST kernel: unmap_page_range+0x56c/0x848
Jul 05 06:27:30 HOST kernel: unmap_single_vma+0x88/0x100
Jul 05 06:27:30 HOST kernel: unmap_vmas+0xdc/0x100
Jul 05 06:27:30 HOST kernel: exit_mmap+0xd4/0x188
Jul 05 06:27:30 HOST kernel: mmput+0x7c/0x160
Jul 05 06:27:30 HOST kernel: do_exit+0x31c/0xab8
Jul 05 06:27:30 HOST kernel: do_group_exit+0x44/0xa0
Jul 05 06:27:30 HOST kernel: __wake_up_parent+0x0/0x30
Jul 05 06:27:30 HOST kernel: el0_svc_common.constprop.2+0x8c/0x190
Jul 05 06:27:30 HOST kernel: do_el0_svc+0x24/0x90
Jul 05 06:27:30 HOST kernel: el0_svc+0x14/0x20
Jul 05 06:27:30 HOST kernel: el0_sync_handler+0x90/0xb8
Jul 05 06:27:30 HOST kernel: el0_sync+0x160/0x180
Jul 05 06:27:30 HOST kernel: Code: 97f92203 d503201f f9401e93 b40002d3 (b9498260)
Jul 05 06:27:30 HOST kernel: ---[ end trace 3474353eefa9fd6c ]---
Jul 05 06:27:30 HOST kernel: note: python3[5056] exited with preempt_count 1
Jul 05 06:27:30 HOST kernel: Fixing recursive fault but reboot is needed!
Jul 05 06:27:30 HOST kernel: ------------[ cut here ]------------
Jul 05 06:27:30 HOST kernel: WARNING: CPU: 1 PID: 5056 at kernel/rcu/tree_plugin.h:297 rcu_note_context_switch+0x5c/0x400
Jul 05 06:27:30 HOST kernel: Modules linked in: zram xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge aufs rfkill governor_performance zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) snd_soc_hdmi_codec snd_soc_audio_graph_card panfrost snd_soc_simple_card snd_soc_simple_card_utils gpu_sched dw_hdmi_cec snd_soc_rockchip_i2s dw_hdmi_i2s_audio hantro_vpu(C) rockchip_vdec(C) v4l2_h264 snd_soc_es8316 rockchip_rga videobuf2_dma_contig v4l2_mem2mem videobuf2_dma_sg snd_soc_core videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common snd_pcm snd_timer videodev sg snd mc soundcore cpufreq_dt sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod realtek
Jul 05 06:27:30 HOST kernel: rockchipdrm analogix_dp dw_hdmi dw_mipi_dsi dwmac_rk stmmac_platform drm_kms_helper stmmac pcs_xpcs cec rc_core drm drm_panel_orientation_quirks
Jul 05 06:27:30 HOST kernel: CPU: 1 PID: 5056 Comm: python3 Tainted: P D C OE 5.10.43-rockchip64 #21.05.4
Jul 05 06:27:30 HOST kernel: Hardware name: Radxa ROCK Pi 4A (DT)
Jul 05 06:27:30 HOST kernel: pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
Jul 05 06:27:30 HOST kernel: pc : rcu_note_context_switch+0x5c/0x400
Jul 05 06:27:30 HOST kernel: lr : rcu_note_context_switch+0x4c/0x400
Jul 05 06:27:30 HOST kernel: sp : ffff8000157b35e0
Jul 05 06:27:30 HOST kernel: x29: ffff8000157b35e0 x28: ffff0000053d4880
Jul 05 06:27:30 HOST kernel: x27: 0000000000000000 x26: ffff800011b41000
Jul 05 06:27:30 HOST kernel: x25: ffff8000102dca7c x24: 0000000000000000
Jul 05 06:27:30 HOST kernel: x23: 0000000000000000 x22: ffff0000053d4880
Jul 05 06:27:30 HOST kernel: x21: ffff800011b27858 x20: ffff0000053d4880
Jul 05 06:27:30 HOST kernel: x19: ffff0000f7751980 x18: 0000000000000010
Jul 05 06:27:30 HOST kernel: x17: 0000000000000000 x16: 0000000000000000
Jul 05 06:27:30 HOST kernel: x15: 0000000000000329 x14: ffff8000157b33c0
Jul 05 06:27:30 HOST kernel: x13: 00000000ffffffea x12: ffff80001194edc8
Jul 05 06:27:30 HOST kernel: x11: 0000000000000003 x10: ffff800011936d88
Jul 05 06:27:30 HOST kernel: x9 : ffff800011936de0 x8 : 0000000000017fe8
Jul 05 06:27:30 HOST kernel: x7 : c0000000ffffefff x6 : 0000000000000001
Jul 05 06:27:30 HOST kernel: x5 : 0000000000000001 x4 : ffff8000e61d1000
Jul 05 06:27:30 HOST kernel: x3 : 0000000000000001 x2 : ffff80001156a000
Jul 05 06:27:30 HOST kernel: x1 : ffff8000e61d1000 x0 : 0000000000000001
Jul 05 06:27:30 HOST kernel: Call trace:
Jul 05 06:27:30 HOST kernel: rcu_note_context_switch+0x5c/0x400
Jul 05 06:27:30 HOST kernel: __schedule+0xac/0x758
Jul 05 06:27:30 HOST kernel: schedule+0x40/0xf8
Jul 05 06:27:30 HOST kernel: do_exit+0xf4/0xab8
Jul 05 06:27:30 HOST kernel: die+0x208/0x248
Jul 05 06:27:30 HOST kernel: die_kernel_fault+0x64/0x78
Jul 05 06:27:30 HOST kernel: __do_kernel_fault+0x74/0x148
Jul 05 06:27:30 HOST kernel: do_page_fault+0x1c8/0x3a8
Jul 05 06:27:30 HOST kernel: do_translation_fault+0x50/0x60
Jul 05 06:27:30 HOST kernel: do_mem_abort+0x40/0xa0
Jul 05 06:27:30 HOST kernel: el1_abort+0x48/0x70
Jul 05 06:27:30 HOST kernel: el1_sync_handler+0x64/0xe8
Jul 05 06:27:30 HOST kernel: el1_sync+0x84/0x140
Jul 05 06:27:30 HOST kernel: lock_page_memcg+0x34/0xc0
Jul 05 06:27:30 HOST kernel: page_remove_rmap+0x1c/0x568
Jul 05 06:27:30 HOST kernel: unmap_page_range+0x56c/0x848
Jul 05 06:27:30 HOST kernel: unmap_single_vma+0x88/0x100
Jul 05 06:27:30 HOST kernel: unmap_vmas+0xdc/0x100
Jul 05 06:27:30 HOST kernel: exit_mmap+0xd4/0x188
Jul 05 06:27:30 HOST kernel: mmput+0x7c/0x160
Jul 05 06:27:30 HOST kernel: do_exit+0x31c/0xab8
Jul 05 06:27:30 HOST kernel: do_group_exit+0x44/0xa0
Jul 05 06:27:30 HOST kernel: __wake_up_parent+0x0/0x30
Jul 05 06:27:30 HOST kernel: el0_svc_common.constprop.2+0x8c/0x190
Jul 05 06:27:30 HOST kernel: do_el0_svc+0x24/0x90
Jul 05 06:27:30 HOST kernel: el0_svc+0x14/0x20
Jul 05 06:27:30 HOST kernel: el0_sync_handler+0x90/0xb8
Jul 05 06:27:30 HOST kernel: el0_sync+0x160/0x180
Jul 05 06:27:30 HOST kernel: ---[ end trace 3474353eefa9fd6d ]---
After that every three minutes I see the following logs that didn’t show up prior to the above dump:
Jul 05 06:28:30 HOST kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jul 05 06:28:30 HOST kernel: rcu: Tasks blocked on level-0 rcu_node (CPUs 0-5):
Jul 05 06:28:30 HOST kernel: (detected by 5, t=15002 jiffies, g=8841773, q=66)
Jul 05 06:28:30 HOST kernel: rcu: All QSes seen, last rcu_preempt kthread activity 1 (4352681183-4352681182), jiffies_till_next_fqs=1, root ->qsmask 0x0
So I’m starting to think that I’ve gotten unlucky and managed to get 2 different boards with bad memory modules, but I would love for the problem to be something else; the novelty of getting new boards is going to wear off pretty quick. Here are some more clues/info that might be useful:
- A sure fire way to get my first board to crash is to do some heavy write activity on the 4 disk SATA array. For this I use
bonnie++
and can usually get a crash within an hour every time. - I am using ZFS to manage the 4 HDDs, but I did some tests and confirmed that the problem still exists when using
mdadm
with a simple RAID5 as well - I have 2 1TB Seagate Barracuda and 2 1TB WD Blue HDDs
- I am using the 60w power supply that came with the SATA HAT kit
- The problem still exists with every combination of Arbian ramlogging and zram being on/off
- It seems like other members on these forums have faced similar issues that remain un-resolved. For example, here and here. (There are more, but I’m only allowed to put two links in this post. Just search “Oops” on the forums.)
Any insight anyone has would be greatly appreciated. Thanks!