首页 > 技术知识 > 正文

1. 前言

使用全志平台系统开发时,出现概率性死机问题; 这里主要描述下死机分析过程

2. 栈信息 [ 27.892505] init: open path: /dev/bus/usb/005/002 [ 29.580872] Unable to handle kernel NULL pointer dereference at virtual address 00000004 [ 29.589952] pgd = c0004000 [ 29.593117] [00000004] *pgd=00000000 [ 29.597211] sunxi oops: enable sdcard JTAG interface [ 29.602744] sunxi oops: cpu frequency: 1008 MHz [ 29.602963] sunxi oops: ddr frequency: 576 MHz [ 29.602963] sunxi oops: gpu frequency: 576 MHz [ 29.602963] sunxi oops: cpu temperature: 66 [ 29.602963] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 29.602963] Modules linked in: 8188eu dummy_acc snd_usb_audio snd_usbmidi_lib snd_hwdep gpio_sunxi sunxi_ir_rx sunxi_sndspdif sndspdif sunxi_spdma sunxi_spdif uvcvideo videobuf_dma_contig videobuf_core mali(O) nand(O) [last unloaded: 8188eu] [ 29.602963] CPU: 0 Tainted: G W O (3.4.39 #1) [ 29.602963] PC is at cpufreq_governor_interactive+0x2cc/0x5c8 [ 29.602963] LR is at cpufreq_governor_interactive+0x2cc/0x5c8 [ 29.602963] pc : [<c0406568>] lr : [<c0406568>] psr: 600f0013 [ 29.602963] sp : e6221d90 ip : e6221d90 fp : e6221dd4 [ 29.602963] r10: 00000000 r9 : 00000000 r8 : 00000002 [ 29.602963] r7 : ffffffff r6 : 00000000 r5 : 00000000 r4 : e61c6680 [ 29.602963] r3 : c08d9378 r2 : 012bb000 r1 : 00000000 r0 : c09389b8 [ 29.602963] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel [ 29.602963] Control: 10c5387d Table: 648a006a DAC: 00000015 ….. ….. [ 29.602963] [<c0406568>] (cpufreq_governor_interactive+0x2cc/0x5c8) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) [ 29.602963] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354) [ 29.602963] [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354) from [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88) [ 29.602963] [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88) from [<c004e05c>] (notifier_call_chain+0x48/0x78) [ 29.602963] [<c004e05c>] (notifier_call_chain+0x48/0x78) from [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c) [ 29.602963] [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c) from [<c0029ca8>] (__cpu_notify+0x3c/0x58) [ 30.600053] fence timeout on [e1b1ddc0] after 1000ms
<
3. 问题分析

初步结论: . 系统跑飞是由于非法地址引起的,经过排查,有两个怀疑点: — 从内存中加载出来的数据存在异常 — cpufreq驱动存在漏洞

4. 详细分析流程 第一步: 1074057 c040596c <cpufreq_governor_interactive>: 1074058 c040596c: e1a0c00d mov ip, sp 1074059 c0405970: e92ddff0 push {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc} 1074060 c0405974: e24cb004 sub fp, ip, #4 1074061 c0405978: e24dd01c sub sp, sp, #28 1074062 c040597c: e92d4000 push {lr} ….. ….. 1074236 c0405c34: eb001352 bl c040a984 <cpufreq_frequency_get_table> 1074237 c0405c38: e5953004 ldr r3, [r5, #4] //+0x2cc 死在这里,所以R5的内容是非法地址 第二步:排查cpufreq_frequency_get_table 1079349 c040a984 <cpufreq_frequency_get_table>: //没发现操作R5 1079350 c040a984: e1a0c00d mov ip, sp 1079351 c040a988: e92dd800 push {fp, ip, lr, pc} 1079352 c040a98c: e24cb004 sub fp, ip, #4 1079353 c040a990: e92d4000 push {lr} 1079354 c040a994: ebf00e1b bl c000e208 <__gnu_mcount_nc> 1079355 c040a998: e59f200c ldr r2, [pc, #12] ; c040a9ac <cpufreq_frequency_get_table+0x28> 1079356 c040a99c: e59f300c ldr r3, [pc, #12] ; c040a9b0 <cpufreq_frequency_get_table+0x2c> 1079357 c040a9a0: e7922100 ldr r2, [r2, r0, lsl #2] 1079358 c040a9a4: e7930002 ldr r0, [r3, r2] 1079359 c040a9a8: e89da800 ldm sp, {fp, sp, pc} 1079360 c040a9ac: c08fcc70 .word 0xc08fcc70 1079361 c040a9b0: c08d8378 .word 0xc08d8378 第三步:往上看汇编,R5很早就被赋值了 利用gdb定位 (gdb) b*0xc0405c38 Note: breakpoint 1 also set at pc 0xc0405c38. Breakpoint 2 at 0xc0405c38: file drivers/cpufreq/cpufreq_interactive.c, line 1561. c语言: 1560 freq_table = cpufreq_frequency_get_table(policy->cpu); 1561 if (!tunables->hispeed_freq) { //出错在这里 1562 #if defined(CONFIG_ARCH_SUN9IW1P1) 再一次确认代码是否符合 1074236 c0405c34: eb001352 bl c040a984 <cpufreq_frequency_get_table> 1074237 c0405c38: e5953004 ldr r3, [r5, #4] 1074238 c0405c3c: e59f82dc ldr r8, [pc, #732] ; c0405f20 <cpufreq_governor_interactive+0x5b4> 1074239 c0405c40: e50b0040 str r0, [fp, #-64] ; 0x40 1074240 c0405c44: e3530000 cmp r3, #0 //确实对上号, if (!tunables->hispeed_freq) { 第四步:从栈信息,__cpufreq_remove_dev() -> __cpufreq_governor(START) -> cpufreq_governor_interactive(START) 出问题代码: static int cpufreq_governor_interactive(struct cpufreq_policy *policy, unsigned int event) { if (have_governor_per_policy()) //tunables赋值,然后经过switch调转到事件CPUFREQ_GOV_START tunables = policy->governor_data; else tunables = common_tunables; //cpu只有一个class就走这里 WARN_ON(!tunables && (event != CPUFREQ_GOV_POLICY_INIT)); switch (event) { case CPUFREQ_GOV_START: mutex_lock(&gov_lock); freq_table = cpufreq_frequency_get_table(policy->cpu); if (!tunables->hispeed_freq) { //跑飞 }
<

怀疑: 1. 变量被修改了 2. Cpufreq Interactive策略被退出了

5. 排查

(1)Interactive回调函数里面加入打印事件发生类型,例如策略开始、停止、退出等,监听系统起来后进入产测,事件的发生经过。

[ 34.181471] **[interactive] event = 2 [ 34.185587] **common_tunables addr: e1a50840 [ 34.191101] **tunables addr: e1a50840 [ 34.195176] **[interactive] event = 5 [ 34.199333] **common_tunables addr: e1a50840 [ 34.204289] **tunables addr: e1a50840

结果:发现interactive策略有退出的动作

(2)在Interactive退出时候,加入stack_dump()

[ 34.181471] **[interactive] event = 2 [ 34.185587] **common_tunables addr: e1a50840 [ 34.191101] **tunables addr: e1a50840 [ 34.195176] **[interactive] event = 5 [ 34.199333] **common_tunables addr: e1a50840 [ 34.204289] **tunables addr: e1a50840 [ 34.208502] [<c00169fc>] (unwind_backtrace+0x0/0xec) from [<c05f8190>] (dump_stack+0x20/0x24) [ 34.218108] [<c05f8190>] (dump_stack+0x20/0x24) from [<c0406524>] (cpufreq_governor_interactive+0x270/0x658) [ 34.229727] [<c0406524>] (cpufreq_governor_interactive+0x270/0x658) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) [ 34.241874] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0) [ 34.253429] [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0) from [<c0401808>] (store_scaling_governor+0x13c/0x17c) [ 34.265173] [<c0401808>] (store_scaling_governor+0x13c/0x17c) from [<c04005bc>] (store+0x6c/0x94) [ 34.275511] [<c04005bc>] (store+0x6c/0x94) from [<c0166a58>] (sysfs_write_file+0x118/0x14c) [ 34.285090] [<c0166a58>] (sysfs_write_file+0x118/0x14c) from [<c010f2a4>] (vfs_write+0xc4/0x140)

结果:发现上层应用有切换cpu freq策略,这种操作引起异常死机的问题所在。

6. 总结

完整的问题分析: (1)从栈信息分析来看,上层应用擅自改动调频策略引起该问题,简单说是系统起来后,调频策略是从performance切到interactive,然后打开cpu hotplug enable,让系统按需调频开核。 (2)有种特殊情况:如果上层软件通过节点从interactive切到其他模式,首先把interactive策略暂停和退出,这时候会把相应资源被释放,例如策略参数common_tunables = NULL和policy->governor_data = NULL,这里没有释放policy(策略集合,每个CPU都有的),但是如果在这时候策略还没切换好 (policy->governor == interactive),开核或者关核时候会直接用到interactive里面资源,这时会造成因为空指针跑飞情况。

猜你喜欢