本文较长 , 建议根据目录观看

第一个问题

问题

现象

运行毕业设计(一个多进程Python程序)不定时内核崩溃(kernel panic) , 鼠标键盘全没用 , magic键无效

错误日志

这个错误日志是很难得才保存下来 , N次死机中唯一一个日志

22:14:50 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
22:14:34 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [CompositorTileW:4417]
22:13:55 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
21:34:56 pulseaudio: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)
21:34:42 spice-vdagent: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
21:34:25 gnome-session-b: Unrecoverable failure in required component org.gnome.Shell.desktop

思路

第一反应就是程序哪里写错了 , 在我反复检查后确定程序没什么问题

然后我曾想到过ryzen在特定情况下会产生segment fault , 但是当时我看的时候文章都是说编译情况下会出错 , 但是我用的是python , 我第一次否定了这种错误

再一次检查程序 , 依旧没有任何问题

后来尝试更新了一次bios , 结果死机的更频繁了

这时我就又一次想到了CPU问题 , 于是我写了一个小脚本测试

测试程序

这个程序的主要原理就是16个进程同时访问16个共享内存地址 , 不断读写 , 看是否会出错

from multiprocessing import Process, Value ,Array

def f(i,states_list):
    n = 0
    while 1:
        if i == 0:
            n += 1
            print(n)
        a = sum(states_list[1:])
        for j in range(16):
            states_list[j] = i

def test4():
    states_list = Array('i', [0 for _ in range(16)])

    envs_p = []
    for i in range(16):
        envs_p.append(Process(target=f, args=(i,states_list,)))

    for i in envs_p:
        i.start()

    for i in envs_p:
        i.join()

if __name__ == '__main__':
    test4()

日志

22:25:05 kernel: page->mem_cgroup:0000000800000000
22:25:05 kernel: page dumped because: page still charged to cgroup
22:25:05 kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000800000000
22:25:05 kernel: flags: 0x17ffffc0000000()
22:25:05 kernel: page:ffffd6f50a7eb940 count:0 mapcount:0 mapping:0000000000000000 index:0x0
22:25:05 kernel: BUG: Bad page state in process conky  pfn:29fae5
20:28:12 kernel: page dumped because: nonzero _count
20:28:12 kernel: raw: 0000000000000001 0000000000000000 00000080ffffffff 0000000000000000
20:28:12 kernel: flags: 0x17ffffc0000000()
20:28:12 kernel: page:ffffd6f50cd9c080 count:128 mapcount:0 mapping:0000000000000000 index:0x1
20:28:12 kernel: BUG: Bad page state in process python3  pfn:336702
19:28:31 systemd-rfkill: Failed to open device rfkill0: No such device
14:51:54 pulseaudio: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
14:51:29 spice-vdagent: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
22:50:11 gnome-session-b: Unrecoverable failure in required component org.gnome.Shell.desktop

当我运行测试程序看到这个日志的时候我基本确定是CPU问题了 , 虽然此时并不会死机

另外下文会有一个内存时序导致死机的问题 , 但这里不是 , 因为内存超频后我都是经过几小时的烤机的

解决

换CPU , 幸好当时CPU是在京东买的盒装 , 要是淘宝就完蛋了……

我没有选择RMA换CPU , 我直接在京东提出维修申请

维修是真的慢 , 京东收到后很快就给厂家了 , 但是厂家修了一个月都没修好(公司效率真低) , 终于京东在我反复催促下 , 直接给我换新 , 感谢京东

换CPU后测试

样例程序如上 , 但这次就没有那些错误了

日志

1
2
3

12:41:19 pulseaudio: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
12:40:55 spice-vdagent: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
14:36:10 gnome-session-b: Unrecoverable failure in required component org.gnome.Shell.desktop

原因

未知

第二个问题

问题

现象

当我尝试在linux和windows下用obs录制4K视频直接黑屏重启

错误日志

linux

12:56:11 kernel: Fixing recursive fault but reboot is needed!
12:56:11 kernel: #PF error: [normal kernel read fault]
12:56:11 kernel: BUG: unable to handle kernel paging request at ffffa0140e155a58
12:55:56 kernel: rcu: 	 (t=15000 jiffies g=184845 q=110453)
12:55:51 kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [gdbus:23003]
12:55:37 kernel: BUG: Bad page state in process kernel_oops  pfn:26b8fb
12:55:23 kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [gdbus:23003]
12:25:43 pulseaudio: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
12:25:41 sudo: pam_unix(sudo:auth): auth could not identify password for [zx]
12:25:19 spice-vdagent: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
20:24:00 gnome-session-b: Unrecoverable failure in required component org.gnome.Shell.desktop
20:23:59 kernel: PKCS#7 signature not signed with a trusted key
20:23:58 kernel: Couldn't get size: 0x800000000000000e
20:23:58 kernel: MODSIGN: Couldn't get UEFI db list
20:23:58 kernel: Couldn't get size: 0x800000000000000e

windows

windows kernel-power 41

思路

第一反应是主板过热或者电源功率不足 , 经过cpu,fpu,gpu双烤后否定这种可能

测试内存时黑屏重启 , 定位问题出在内存

但是内存设定和以前是一模一样的 , 想了一会后尝试将内存先调到2133CL16测试 , 测试通过

到这里就知道是内存时序的问题了

解决

将时序从3000CL14调至2933CL14

原因

同一个型号的不同的cpu对内存兼容性不同

PS

把内存时序调低后延迟提高了2ns , 内存copy变慢 , 这在情理之中 , 但内存读写速度居然提升了 , WTF , 实验误差吗?

尾记

虽然ryzen一代有各种小问题 , 但是改变了整个市场 , 使消费cpu多核性能快速提升

后来做了下超频测试 , 发现换的CPU比原来的好多了 , 默认电压3700MHz能过P95半小时测试

最后

AMD YES!