首页 > 技术知识 > 正文

1. log分析 [ 3537.282130] PC is at do_page_fault+0x40/0x2e0 [ 3537.282130] LR is at do_translation_fault+0x5c/0xd4 [ 3537.282130] pc : [<ffffffc000095704>] lr : [<ffffffc000095a00>] pstate: 800001c5 [ 3537.282130] sp : ffffffc027b38130 [ 3537.282130] x29: ffffffc027b38130 x28: ffffffc027bdc000 [ 3537.282130] x27: ffffffc0009fa0e9 x26: 0000000000000000 [ 3537.282130] x25: 0000000096000005 x24: 0000000000000025 [ 3537.282130] x23: 0000000000000000 x22: ffffffc027b38390 [ 3537.282130] x21: 00000000000002b0 x20: 00000000000002b0 [ 3537.282130] x19: ffffffc027b38390 x18: 000000000000001e [ 3537.282130] x17: 00000000000101d0 x16: ffffffc0111dccf4 [ 3537.282130] x15: ffffffc0111dcc04 x14: 0000000000000003 [ 3537.282130] x13: 000000004437411e x12: ffffffc000822000 [ 3537.282130] x11: 0000000000000006 x10: 0000000000000007 [ 3537.282130] x9 : 000000000000000e x8 : 00125bbb859b6f00 [ 3537.282130] x7 : 0000000000000012 x6 : ffffffc000cf77d0 [ 3537.282130] x5 : ffffffc00079e3d8 x4 : ffffffc00079e3d8 [ 3537.282130] x3 : ffffffc0000959a4 x2 : ffffffc027b38390 [ 3537.282130] x1 : 0000000096000005 x0 : 00000000800001c5

do_page_fault入栈汇编:

ffffffc0000956c4 <do_page_fault>: ffffffc0000956c4: a9a87bfd stp x29, x30, [sp,#-384]! ffffffc0000956c8: 910003fd mov x29, sp

死机现场sp和x29

sp : ffffffc027b38130 x29: ffffffc027b38130 x30(lr):ffffffc000095a00

x30为上一级LR寄存器数据,x30入栈[sp-384+8]地址ffffffc027b38138内存中,内存地址ffffffc027b38138中数据为ffffffc000095a00, 经过SP的反向推断,sp中存放的lr数据与跑飞是的lr数据一致;说明CPU数据正常.

2. DS5级别分析

使用DS5依次按照cpu0–>cpu1–>cpu2–>cpu3进行连接, 依次DS5 stop掉在线的cpu,依次load vmlinux, 之后就可以查看所有的CPU栈信息 dump 出cpu current thread信息: info stack H64 el1_entry 异常中断调试分析

从stack中可以看到,崩溃原因是cpu访问非法地址后触发了el1_sync异常中断, 中断处理过程中检查到触发中断的原因是data abort in EL1后, 跳入到do_mem_abort流程进行缺页异常处理, do_page_fault阶段检测到该非法地址触发在内核空间,产生panic异常崩溃.

目前多次死机现场一致,且出现问题时,传入的addr参数比较随机, 目前怀疑32位用户空间往64位内核传递参数中指针出现异常, 导致cpu在内核空间访问该地址时出现异常. 目前主要困难是DS5只能抓取el1_sync异常中断后的崩溃流程, 至于Cpu异常中断前的CPU的SPSR和SP,还需要通过汇编进行推导.

el1_sync kernel_entry el=1中sp = sp – (288-240) sp = sp – (15*16) x21寄存器 = sp + 288 x22寄存器 = el1 lr x23寄存器 = el1 spsr 将lr寄存器入栈 [sp + 240] //LR 将x21寄存器值入栈[sp + 240 + 8] 将x22寄存器值入栈[sp + 256] //PC 将x23寄存器值入栈[sp + 256 +8] el1_da: x2 = sp do_mem_abort: x29寄存器入栈[sp-176] x30寄存器入栈[sp-176+8] sp = sp – 176 x29 = sp do_translation_fault: x29–>[sp-48] x30–>[sp-48+8] sp = sp – 48 do_page_fault: x29–>[sp-384] x30–>[sp-384+8] sp = sp – 384
<

由于1~3节中的现场已经被破坏了,所以无法读取内存,重新复现到现象时抓取有效数据如下:

#0 arch_counter_get_cntvct() at arch_timer.h:153 #1 __delay(cycles = 24000) at delay.c:31 #2 __const_udelay(xloops = <Value currently has no location>) at delay.c:42 #3 panic(fmt = <Value currently has no location>) at panic.c:187 #4 die(str = <Value currently has no location>, regs = (struct pt_regs*) 0xFFFFFFC0297FC050, err = -1778384891) at traps.c:247 #5 __do_kernel_fault(mm = (struct mm_struct*) 0xFFFFFFC029BF8680, addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:102 #6 do_translation_fault(addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:362 #7 do_mem_abort(addr = 18446743833205608120, esr = 2516582405, regs = (struct pt_regs*) 0xFFFFFFC0297FC050) at fault.c:459 #8 [el1_sync+0xB0]

(1)#11:try_to_wake_up

在#10中,sp变化以及x30数据入栈操作如下: #11-x29 –> #11-sp -80 #11-x30 –> #11-sp -80 +8 #10-sp = #11-sp -80 = 0xFFFFFFC0297FC170 #11-x29 = DS5抓取数据 0xFFFFFFC0297FC1C0 #11-x30 = DS5抓取数据 0xFFFFFFC0000CEC7C #11-SP = 0xFFFFFFC0297FC1C0 LR = X30 = 0xFFFFFFC0000CEC7C 汇编代码为: ffffffc0000cea74 <try_to_wake_up>: … … … … ffffffc0000cec78: 97ffecd8 bl ffffffc0000c9fd8 <ttwu_stat> –>ffffffc0000cec7c: 14000020 b ffffffc0000cecfc <try_to_wake_up+0x288> … … … … x19 寄存器:0xFFFFFFC012DD3440

(2)#10: ttwu_stat

#10-cpsr = #9-spsr = 0x00000000800001C5 M[4:0] = 0b00101 AARCH64 EL1h系统异常模式 M[0]= 0b1 SP_EL1 作为SP #10-sp = #9-sp = 0xFFFFFFC0297FC170 #9-lr 0xFFFFFFC0000CA014推导代码位置: ffffffc0000c9fd8 <ttwu_stat>: ffffffc0000c9fd8: a9bb7bfd stp x29, x30, [sp,#-80]! ffffffc0000c9fdc: 910003fd mov x29, sp ffffffc0000c9fe0: a90153f3 stp x19, x20, [sp,#16] ffffffc0000c9fe4: a9025bf5 stp x21, x22, [sp,#32] ffffffc0000c9fe8: a90363f7 stp x23, x24, [sp,#48] ffffffc0000c9fec: f90023f9 str x25, [sp,#64] ffffffc0000c9ff0: 90006656 adrp x22, ffffffc000d91000 <__key.22563> ffffffc0000c9ff4: aa0003f3 mov x19, x0 ffffffc0000c9ff8: 9102e2d6 add x22, x22, #0xb8 ffffffc0000c9ffc: aa1e03e0 mov x0, x30 ffffffc0000ca000: 2a0103f8 mov w24, w1 ffffffc0000ca004: 2a0203f7 mov w23, w2 ffffffc0000ca008: 97ff185a bl ffffffc000090170 <_mcount> ffffffc0000ca00c: b00054b5 adrp x21, ffffffc000b5f000 <cpu_worker_pools+0x440> ffffffc0000ca010: 940a6f4c bl ffffffc000365d40 <debug_smp_processor_id> —>ffffffc0000ca014: f8605ad4 ldr x20, [x22,w0,uxtw #3]

cpu 在EL1系统异常模式从el1_sync–>el1_da传入do_mem_abort的X0寄存器如下: mrs X0, far_el1 //el1 FAR异常地址寄存器 X0中异常地址为0x199999940015B4AC,在日常测试时发现该地址数值非常随机; 目前怀疑cpu执行指令ldr x20, [x22,w0,uxtw #3]期间,访问寄存器地址时出现异常,异常中断产生后,lr指向当前触发异常的指令

1).排查x22寄存器数据:

#10中栈保存的上一级x22保存在栈[0xFFFFFFC0297FC170+32+8]=[0xFFFFFFC0297FC198]中,DS5抓取数据为x22:0x00000000 00000000 #9中栈保存的X22寄存器经过DS5抓取数据为:0xFFFFFFC000d910b8, 先使用#10栈中保存的x22数据结合代码进行推算: adrp x22, ffffffc000d91000 <__key.22563> //计算得到x22 = ffffffc000d91000 add x22, x22, #0xb8 //计算得到x22 = ffffffc000d910b8 推算后x22的数据为ffffffc000d910b8,该数据与#9-x22中保存的数据一致.

2).排查x22,w0,uxtw #3

x22 = 0xffffffc000d910b8 w0 = ((unsigned long)w0)<<3 x22 + w0 = 0x199999940015B4AC ? 反推: w0 = 0x199999D3FF3CA3F4 ? w0>>3 = 0x333333A7FE7947E #8:el1_sync 下CPU寄存器状态数据: PC 0xFFFFFFC000083C30 SP 0xFFFFFFC0297FC050 W0 0x00001317 //数据异常 W1 0xCBD6EEA0 W2 0x0000000C W3 0xCBD701B6 W4 0x00000001 W5 0x0035EEBC W6 0x00CD2E21 W7 0x2064656C W8 0x20706F74 W9 0x7F7F7F7F W10 0xFEFEFEFF W11 0x7F7F7F7F W12 0x01010101 W13 0x00000038 W14 0xFFFFFFFE W15 0x00000000 W16 0x001E1B30 W17 0x00000000 W18 0x00000000 W19 0x00005DC0 W20 0x001DC004 W21 0x00000001 W22 0x001DC068 W23 0x00000056 W24 0x96000005 W25 0x00D91000 W26 0x00B5F000 W27 0x009FA0E9 W28 0x297FC000 W29 0x297FBE10 W30 0x00353AEC
<

(3)EL1 Mode中保存的的栈数据从#8数据结合el1_sync代码流程推导

el1 模式中: /#9-sp = #8-sp+(15*16)+(288-240)=#8-sp+288= 0xFFFFFFC0297FC170 从代码中推出:#8-x21 = #8-sp + 288 ,现场#8-x21=0xFFFFFFC0297FC170,代码推导与现场cpu数据一致; 且代码推导#9-sp 数据和现场cpu状态数据一致,#9-sp正确。

/#9-lr = [#8-sp+240]=[0xFFFFFFC0297FC050+240]= [0xFFFFFFC0297FC140] = (DS5 dump memory) 0xFFFFFFC0000CA014

/#9-el1 lr = [#8-sp+256]=[0xFFFFFFC0297FC050+256]=[0xFFFFFFC0297FC150] = (DS5 dump memory)0xFFFFFFC0000CA014 代码中:x22寄存器 = el1 lr ,现场#9-x22寄存器 = 0xFFFFFFC0000CA014,与el1 lr数据一致;

/#9-spsr = [0xFFFFFFC0297FC050+256+8]= [0xFFFFFFC0297FC158] = (DS5 dump memory)0x00000000800001C5 从代码中推出:x23寄存器 = el1 spsr,现场x23=0x00000000800001C5,代码推导数据和cpu状态数据正确;

EL1h Mode阶段在kernel_entry中保存了异常中断前系统的X0~X29寄存器 汇编代码:

sp = sp – (288-240)// = 0xFFFFFFC0297FC140 push x28, x29 // stp \xreg1,\xreg2,[sp,#-16]! push x26, x27 push x24, x25 push x22, x23 push x20, x21 push x18, x19 push x16, x17 push x14, x15 push x12, x13 push x10, x11 push x8, x9 push x6, x7 push x4, x5 push x2, x3 push x0, x1

DS5抓取栈[#9-sp -48] ~ [#9-sp -48 -240]地址内存数据: sp[0xFFFFFFC0297FC140] ~ [0xFFFFFFC0297FC050] 结合kernel_entry汇编反推栈中寄存器分布: X28–>[sp-16]:0xFFFFFFC0297FC130 = 0xFFFFFFC0297FC000

X29–>[sp-8]:0xFFFFFFC0297FC138 = 0xFFFFFFC0297FC170 得到的寄存器数据分布如下: EL1N:0xFFFFFFC0297FC050: X0 0x00000000FFFFFFC0 X1 0x0000000000000000 X2 0x0000000000000000 X3 0x0000000000000200 X4 0x0000000000000000 X5 0x0000000000000044 X6 0xFFFFFFC000CDB33C EL1N:0xFFFFFFC0297FC088: X7 0x0000000000000000 X8 0xFFFFFFC000CDB33C X9 0x7F7F7F7F7F7F7F7F X10 0x67531F534F4C4444 X11 0x7F7F7F7F7F7F7F7F X12 0x0101010101010101 X13 0x0000000000000028 EL1N:0xFFFFFFC0297FC0C0: X14 0xFFFFFFFFFFFFFFFF X15 0x0000000000000000 X16 0xFFFFFFC0001E1B30 X17 0x0000000000000000 X18 0x0000000000000000 X19 0xFFFFFFC012DD3440 X20 0x0000000000000001 EL1N:0xFFFFFFC0297FC0F8: X21 0xFFFFFFC000B5F000 X22 0xFFFFFFC000D910B8 X23 0x0000000000000000 X24 0x0000000000000000 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 EL1N:0xFFFFFFC0297FC130: X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FC170

(4)

el1_sync–>kernel_entry el=1 sp = sp – (288-240) sp = sp – (15*16) //将X0-X29寄存器入栈 x21寄存器 = sp + 288 x22寄存器 = el1 lr x23寄存器 = el1 spsr 将lr寄存器入栈 [sp + 240] //LR 将x21寄存器值入栈[sp + 240 + 8] 将x22寄存器值入栈[sp + 256] //PC 将x23寄存器值入栈[sp + 256 +8] el1_da: x2 = sp

现场栈数据:

X19 0xFFFFFFC012DD3440 X20 0x0000000000000001 X21 0xFFFFFFC0297FC170 X22 0xFFFFFFC0000CA014 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FC170 PC 0xFFFFFFC000083C30 SP 0xFFFFFFC0297FC050

代码推导:#8-sp = #7-sp + 176 = 0xFFFFFFC0297FBFA0 + 176 = 0xFFFFFFC0297FC050 从#7-sp 反推到#8-sp 的理论值和现场cpu SP栈数据一致,且#8-sp 与 #8-X21 -一致,数据正常.

(5)

data_bad–>do_mem_abort x29寄存器入栈[sp-176] x30寄存器入栈[sp-176+8] sp = sp – 176 x29 = sp

现场栈数据:

X19 0x0000000096000005 X20 0xFFFFFFC800D90EB8 X21 0xFFFFFFC000B70E90 X22 0xFFFFFFC0297FC050 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FBFA0 PC 0xFFFFFFC000081238 SP 0xFFFFFFC0297FBFA0

代码推导:#7-sp = #6-sp + 48 = 0xFFFFFFC0297FBF70 +48 = 0xFFFFFFC0297FBFA0 从#6-sp 推导出的#7-sp 的理论值 和 现场cpu SP栈数据一致,且#7-sp 与 #7-X29一致,数据正常.

(6)

do_mem_abort–>do_translation_fault x29–>[sp-48] x30–>[sp-48+8] sp = sp – 48 x29 = sp

现场栈数据:

X19 0xFFFFFFC0297FC050 X20 0xFFFFFFC800D90EB8 X21 0x0000000096000005 X22 0xFFFFFFC029BF8680 X23 0x00000000800001C5 X24 0x0000000000000025 X25 0xFFFFFFC000D91000 X26 0xFFFFFFC000B5F000 X27 0xFFFFFFC0009FA0E9 X28 0xFFFFFFC0297FC000 X29 0xFFFFFFC0297FBF70 PC 0xFFFFFFC000095A64 SP 0xFFFFFFC0297FBF70

结论:#5-sp + 48 = 0xFFFFFFC0297FBF40 +48 = 0xFFFFFFC0297FBF70 从#5-sp 反推到#6-sp 的理论值和现场栈数据一致,且#6-sp 与 #6-X29一致,数据正常.

猜你喜欢