Rock3a的两个bug反馈(内核以及u-boot)

内核部分:
commit id: 526c758e05e0

在关机或者重启的时候出现以下情况,在watchdog0超时以后触发重置

[   22.460242] watchdog: watchdog0: watchdog did not stop!
[   83.748310] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   83.748875] rcu:     2-...0: (1 GPs behind) idle=f8a/1/0x4000000000000000 softirq=3966/3967 fqs=5936
[   83.749661] rcu:     (detected by 3, t=18002 jiffies, g=3853, q=289)
[   83.750224] Call trace:
[   83.750468]  __switch_to+0xc4/0x128
[   83.750782]  0xffffff800803bb48

u-boot部分:

在插着一个没有系统的nvme盘的时候无法启动

U-Boot SPL board init
U-Boot SPL 2017.09-gf48bdfd82a-211223 #aghost (Dec 31 2021 - 21:09:55)
unknown raw ID phN
unrecognized JEDEC id bytes: 00, 00, 00
Trying to boot from MMC2
## Verified-boot: 0
## Checking atf-1 0x00040000 ... sha256(fe4f274c06...) + OK
## Checking uboot 0x00a00000 ... sha256(6d803699cb...) + OK
## Checking fdt 0x00b2f250 ... sha256(ed0e2abb0d...) + OK
## Checking atf-2 0x00068000 ... sha256(8d44036095...) + OK
## Checking atf-3 0xfdcd0000 ... sha256(e410275b51...) + OK
## Checking atf-4 0xfdcc9000 ... sha256(990c53fc01...) + OK
## Checking atf-5 0x00066000 ... sha256(315a4195a9...) + OK
Jumping to U-Boot(0x00a00000) via ARM Trusted Firmware(0x00040000)
Total: 244.986 ms

INFO:    Preloader serial: 2
NOTICE:  BL31: v2.3():v2.3-181-gc9a647cae:cl
NOTICE:  BL31: Built : 10:55:41, Oct 18 2021
INFO:    GICv3 without legacy support detected.
INFO:    ARM GICv3 driver initialized in EL3
INFO:    pmu v1 is valid
INFO:    dfs DDR fsp_param[0].freq_mhz= 1056MHz
INFO:    dfs DDR fsp_param[1].freq_mhz= 324MHz
INFO:    dfs DDR fsp_param[2].freq_mhz= 528MHz
INFO:    dfs DDR fsp_param[3].freq_mhz= 780MHz
INFO:    Using opteed sec cpu_context!
INFO:    boot cpu mask: 0
INFO:    BL31: Initializing runtime services
WARNING: No OPTEE provided by BL2 boot loader, Booting device without OPTEE initialization. SMC`s destined for OPTEE will return SMC_UNK
ERROR:   Error initializing runtime service opteed_fast
INFO:    BL31: Preparing for EL3 exit to normal world
INFO:    Entry point address = 0xa00000
INFO:    SPSR = 0x3c9


U-Boot 2017.09-gf48bdfd82a-211223 #aghost (Dec 31 2021 - 21:09:55 +0800)

Model: Radxa ROCK 3 Model A
PreSerial: 2, raw, 0xfe660000
DRAM:  7.7 GiB
Sysmem: init
Relocation Offset: ed349000
Relocation fdt: eb9f6f20 - eb9fecd0
CR: M/C/I
Using default environment

Bootdev: nvme 0
PartType: EFI
No misc partition
boot mode: None
FIT: No boot partition
No resource partition
No resource partition
Failed to load DTB, ret=-19
Failed to get kernel dtb, ret=-19
I2c0 speed: 100000Hz
vsel-gpios- not found! Error: -2
vdd_cpu 1025000 uV
PMIC:  RK8090 (on=0x40, off=0x00)
vdd_logic init 900000 uV
vdd_gpu init 900000 uV
vdd_npu init 900000 uV
of_get_regulator: Get (vccio3-supply) regulator: 0 failed, ret=-19
of_get_regulator: Get (vccio5-supply) regulator: 0 failed, ret=-19
of_get_regulator: Get (vccio7-supply) regulator: 0 failed, ret=-19
io-domain: OK
Model: Radxa ROCK 3 Model A
rockchip_set_ethaddr: vendor_storage_write failed -19
rockchip_set_serialno: could not find efuse/otp device
CLK: (sync kernel. arm: enter 816000 KHz, init 816000 KHz, kernel 0N/A)
  apll 816000 KHz
  dpll 528000 KHz
  gpll 1188000 KHz
  cpll 1000000 KHz
  npll 24000 KHz
  vpll 24000 KHz
  hpll 24000 KHz
  ppll 200000 KHz
  armclk 816000 KHz
  aclk_bus 150000 KHz
  pclk_bus 50000 KHz
  aclk_top_high 300000 KHz
  aclk_top_low 200000 KHz
  hclk_top 150000 KHz
  pclk_top 50000 KHz
  aclk_perimid 300000 KHz
  hclk_perimid 150000 KHz
  pclk_pmu 100000 KHz
No misc partition
Net:   No ethernet found.
Hit key to stop autoboot('CTRL+C'):  0

Device 0: Vendor: 0x14a4 Rev: 1.01     Prod: P02738110277
            Type: Hard Disk
            Capacity: 244198.3 MB = 238.4 GB (500118192 x 512)
... is now current device
Scanning nvme 0:1...
no mmc device at slot 1
no mmc device at slot 0
starting USB...
Bus dwc3@fcc00000: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
Bus dwc3@fd000000: usb maximum-speed not found
Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus dwc3@fcc00000 for devices... 1 USB Device(s) found
scanning bus dwc3@fd000000 for devices... 1 USB Device(s) found
       scanning usb for storage devices... 0 Storage Device(s) found

Device 0: unknown device
No ethernet found.
missing environment variable: pxeuuid
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/00000000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/0000000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/000000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/00000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/0000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/000
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/00
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/0
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default-arm-rockchip
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default-arm
No ethernet found.
missing environment variable: bootfile
Retrieving file: pxelinux.cfg/default
No ethernet found.
Config file not found
No ethernet found.
No ethernet found.
## Booting FIT Image FIT: No boot partition
FIT: No FIT image
Could not find misc partition
ANDROID: reboot reason: "(none)"
optee check api revision fail: -1.0
optee api revision is too low
### ERROR ### Please RESET the board ###

arch/arm/mach-rockchip/boot_rkimg.c:71

#ifdef CONFIG_NVME
	struct udevice *udev;

	pci_init();
	ret = nvme_scan_namespace();
	if (!ret) {
		ret = blk_get_device(IF_TYPE_NVME, 0, &udev);
		if (!ret) {
			devtype = "nvme";
			devnum = "0";
			env_set("devtype", devtype);
			env_set("devnum", devnum);
			goto finish;
		}
	} else {
		printf("Set nvme as boot storage fail ret=%d\n", ret);
	}
#endif

这段代码导致,只要存在nvme存储设备,就会将其设置为主启动设备,然后忽略掉其他了

1 Like

还有一个问题是nvme盘在一定时间的高强度IO之后会出问题。。。
盘的型号是 光威Basic256G

[34318.094045] hrtimer: interrupt took 163920 ns
[36813.913054] print_req_error: I/O error, dev nvme0n1, sector 304912
[36813.913123] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 12060059 (offset 0 size 0 starting block 38115)
[36813.913145] Buffer I/O error on device nvme0n1p1, logical block 37858
[36813.913185] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 12060059 (offset 12464128 size 12288 starting block 38118)
[36813.913205] Buffer I/O error on device nvme0n1p1, logical block 37859
[36813.913221] Buffer I/O error on device nvme0n1p1, logical block 37860
[36813.913239] Buffer I/O error on device nvme0n1p1, logical block 37861
[36816.330765] JBD2: Detected IO errors while flushing file data on nvme0n1p1-8
[36844.635920] print_req_error: I/O error, dev nvme0n1, sector 304936
[36844.635989] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 12060059 (offset 0 size 0 starting block 38118)
[36844.636012] Buffer I/O error on device nvme0n1p1, logical block 37861
[36844.636055] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 12060059 (offset 12476416 size 12288 starting block 38121)
[36844.636088] Buffer I/O error on device nvme0n1p1, logical block 37862
[36844.636108] Buffer I/O error on device nvme0n1p1, logical block 37863
[36844.636121] Buffer I/O error on device nvme0n1p1, logical block 37864
[36848.041887] JBD2: Detected IO errors while flushing file data on nvme0n1p1-8
[36859.861811] print_req_error: I/O error, dev nvme0n1, sector 248710600
[36859.861961] Aborting journal on device nvme0n1p1-8.
[36859.866317] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:61: Detected aborted journal
[36859.866388] EXT4-fs (nvme0n1p1): Remounting filesystem read-only



即使使用pm9a1,也会在这点负载下直接占满nvme盘的io,这绝对是内核nvme io有bug了

确认一下你的 SSD 没有问题,这种情况之前遇到,一般是 SSD 的控制器有问题(被静电打了)或者 SSD 有坏块。拿同一个 SSD 在 PC 上测试看看。

Make sure this is not a SSD issue, test the SSD on the PC first.

确认过了。。一块三星pm9a1 一块浦科特m8se 四块光威basic,然后四块板子都一样
在pc上是好的

内核是commitId:526c758e05e0

重新跑了下测试,发现是docker-ce不兼容这个nvme???

不过关机或者重启的bug确实存在

设备树中关闭npu 就不会在关机或重启操作中出现问题……
可能是电路或者内核问题……

这2个问题修复了。请用最新 ROCK 3A Ubuntu/Debian

好的,我试试看
另外目前确实发现pcie驱动有问题,在基本0负载的情况下nvme ssd会随机出现io错误,pm9a1 浦科特m8se 光威basic 均有此现象
使用nvme作为k8s集群etcd的存储,基本可以在8h内复现
使用nvme作为minio集群的存储,基本可以在3天内复现

在出现io错误后会导致fs损坏,卸载后可以经过修复重新挂载

k8s集群为3节点master
minio集群为4节点

hello想问下 rock3a的uboot代码从哪里下载呀