Comments

Kernel – 6.6 is out – OK with Latest NVIDIA – Needs VMware 17.5.0 Patchset.. — 1 Comment

  1. Commenting here months later as I’ve finally traced anomalous behavior to its (suspected) source.

    I have a 32GB Dell Precision M4700 laptop that’s been running Fedora for 10+ years; and since I periodically do some Windows-dependent work on it, I’ve had VMW Workstation running on it as well.

    In the past few months, weirdness began – exhaust fans would spin to max when system was idle (sadc would be pegged/hung), logins would timeout, even sysrq commands would be refused. Hard power-off was the only solution.

    Eventually I noticed this only happened after I had spun up a VM for any period of time. Looking at kernel logs, I found that a persistent stall was occurring in one or more of the 8 cores, with rcu_preempt messages appearing every 3 minutes. Usually, the stalled CPU was idle and the NMI backtrace was skipped; but during active use, the backtraces revealed any number of processes could be involved. Also, across reboots the stalled CPU core number would change, leading me away from the idea of a hardware fault.

    I finally got around to looking through the kernel logs from many past boot sessions, and the last one I found that did NOT have the rcu warnings was way back on 8 December 2023 using kernel 6.6.4-100.fc38.x86_64, built on (gcc (GCC) 13.2.1 20231011)

    The next boot session on 19 December 2023 was the first time I saw the behavior, and was 6.6.6-100.fc38.x86_64, built on the same gcc.

    It has persisted across the upgrades from F38->39 and VMW Workstation 16->17.5->17.5.1, including clean reinstalls of VMWS 17.5.0 and 17.5.1

    During boot, a message similar to this is logged:

    Dec 19 00:53:03 aries kernel: microcode: updated early: 0x20 -> 0x21, date = 2019-02-13
    Dec 19 00:53:03 aries kernel: Linux version 6.6.6-100.fc38.x86_64 (mockbuild@7a4a2a4d6aa34c638aa842bb8b9e2def) (gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), GNU ld version >
    ———–snip—————
    Dec 19 00:57:39 aries kernel: /dev/vmmon[5190]: PTSC: initialized at 3189631000 Hz using TSC, TSCs are synchronized.
    Dec 19 00:57:40 aries kernel: /dev/vmmon[5190]: Monitor IPI vector: 0
    Dec 19 00:57:40 aries kernel: /dev/vmmon[5190]: HV IPI vector: 0
    Dec 19 00:57:40 aries kernel: ————[ cut here ]————
    Dec 19 00:57:40 aries kernel: WARNING: CPU: 1 PID: 5190 at kernel/rcu/tree_plugin.h:734 rcu_sched_clock_irq+0xb7e/0x1130
    Dec 19 00:57:40 aries kernel: Modules linked in: vmnet(OE) parport_pc vmmon(OE) snd_seq_dummy snd_hrtimer nf_nat_ftp nf_conntrack_ftp nf_conntrack_netbios_ns nf_conntrack_br>
    Dec 19 00:57:40 aries kernel: videobuf2_common dell_smm_hwmon mac80211 snd_hda_core kvm libarc4 videodev dell_wmi snd_hwdep irqbypass snd_seq snd_seq_device mc rapl snd_pcm>
    Dec 19 00:57:40 aries kernel: Unloaded tainted modules: vmnet(OE):2 vmmon(OE):2 [last unloaded: vmnet(OE)]
    Dec 19 00:57:40 aries kernel: CPU: 1 PID: 5190 Comm: vmware-vmx Tainted: P OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 00:57:40 aries kernel: Hardware name: Dell Inc. Precision M4700/082H3V, BIOS A19 11/30/2018
    Dec 19 00:57:40 aries kernel: RIP: 0010:rcu_sched_clock_irq+0xb7e/0x1130
    Dec 19 00:57:40 aries kernel: Code: 38 08 00 00 85 c0 0f 84 ab f5 ff ff e9 b7 fc ff ff c6 87 39 08 00 00 01 e9 9a f5 ff ff 48 89 ef e8 b7 a8 f3 ff e9 10 ff ff ff 0b e9 >
    Dec 19 00:57:40 aries kernel: RSP: 0018:ffffba058013cdf0 EFLAGS: 00010086
    Dec 19 00:57:40 aries kernel: RAX: ffffa01083c90000 RBX: 0000000000000000 RCX: 00000000000f2ee9
    Dec 19 00:57:40 aries kernel: RDX: 00000000ffffffd1 RSI: ffffffff9a888eb2 RDI: ffffa01083c90000
    Dec 19 00:57:40 aries kernel: RBP: ffffa0178e262280 R08: ffffa0178e261430 R09: 0000000000000000
    Dec 19 00:57:40 aries kernel: R10: 0000000000000000 R11: ffffba058013cff8 R12: ffffa0178e264d00
    Dec 19 00:57:40 aries kernel: R13: ffffba0583e838f8 R14: ffffa0178e264d10 R15: ffffa0178e2647c0
    Dec 19 00:57:40 aries kernel: FS: 00007fee118f2c00(0000) GS:ffffa0178e240000(0000) knlGS:0000000000000000
    Dec 19 00:57:40 aries kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 19 00:57:40 aries kernel: CR2: 00007fee025ab000 CR3: 0000000368ce2003 CR4: 00000000001706e0
    Dec 19 00:57:40 aries kernel: Call Trace:
    Dec 19 00:57:40 aries kernel:
    Dec 19 00:57:40 aries kernel: ? rcu_sched_clock_irq+0xb7e/0x1130
    Dec 19 00:57:40 aries kernel: ? __warn+0x81/0x130
    Dec 19 00:57:40 aries kernel: ? rcu_sched_clock_irq+0xb7e/0x1130
    Dec 19 00:57:40 aries kernel: ? report_bug+0x171/0x1a0
    Dec 19 00:57:40 aries kernel: ? handle_bug+0x3c/0x80
    Dec 19 00:57:40 aries kernel: ? exc_invalid_op+0x17/0x70
    Dec 19 00:57:40 aries kernel: ? asm_exc_invalid_op+0x1a/0x20
    Dec 19 00:57:40 aries kernel: ? rcu_sched_clock_irq+0xb7e/0x1130
    Dec 19 00:57:40 aries kernel: ? __pfx_pvclock_gtod_notify+0x10/0x10 [kvm]
    Dec 19 00:57:40 aries kernel: ? timekeeping_update+0xdd/0x130
    Dec 19 00:57:40 aries kernel: ? timekeeping_advance+0x377/0x590
    Dec 19 00:57:40 aries kernel: update_process_times+0x74/0xb0
    Dec 19 00:57:40 aries kernel: tick_sched_handle+0x21/0x60
    Dec 19 00:57:40 aries kernel: tick_sched_timer+0x6f/0x90
    Dec 19 00:57:40 aries kernel: ? __pfx_tick_sched_timer+0x10/0x10
    Dec 19 00:57:40 aries kernel: __hrtimer_run_queues+0x112/0x2b0
    Dec 19 00:57:40 aries kernel: hrtimer_interrupt+0xf8/0x230
    Dec 19 00:57:40 aries kernel: __sysvec_apic_timer_interrupt+0x50/0x140
    Dec 19 00:57:40 aries kernel: sysvec_apic_timer_interrupt+0x6d/0x90
    Dec 19 00:57:40 aries kernel:
    Dec 19 00:57:40 aries kernel:
    Dec 19 00:57:40 aries kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20
    Dec 19 00:57:40 aries kernel: RIP: 0010:__pte_offset_map+0xfa/0x180
    Dec 19 00:57:40 aries kernel: Code: e0 db 48 39 d0 75 45 48 c1 eb 09 81 e3 f8 0f 00 00 48 89 f8 0f 1f 00 48 8b 15 4a 59 62 01 48 85 f6 74 79 48 81 e2 00 00 e0 ff 03 1d >
    Dec 19 00:57:40 aries kernel: RSP: 0018:ffffba0583e839a8 EFLAGS: 00000206
    Dec 19 00:57:40 aries kernel: RAX: 000000010a5af067 RBX: 0000000000000d58 RCX: 000000010a5af067
    Dec 19 00:57:40 aries kernel: RDX: 000ffffffffff000 RSI: 0000000000000000 RDI: 000000010a5af067
    Dec 19 00:57:40 aries kernel: RBP: ffffba0583e839c8 R08: 800000036169b867 R09: ffffa0178e27a4e8
    Dec 19 00:57:40 aries kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffa012e9228090
    Dec 19 00:57:40 aries kernel: R13: ffffba0583e83ac8 R14: 0000000000000000 R15: ffffa01305ceddc0
    Dec 19 00:57:40 aries kernel: ? __pte_offset_map+0x1b/0x180
    Dec 19 00:57:40 aries kernel: __pte_offset_map_lock+0x44/0x110
    Dec 19 00:57:40 aries kernel: do_anonymous_page+0x119/0x3b0
    Dec 19 00:57:40 aries kernel: __handle_mm_fault+0xbe6/0xd90
    Dec 19 00:57:40 aries kernel: handle_mm_fault+0x17f/0x360
    Dec 19 00:57:40 aries kernel: do_user_addr_fault+0x1ed/0x660
    Dec 19 00:57:40 aries kernel: exc_page_fault+0x7f/0x180
    Dec 19 00:57:40 aries kernel: asm_exc_page_fault+0x26/0x30
    Dec 19 00:57:40 aries kernel: RIP: 0010:rep_movs_alternative+0x4a/0x70
    Dec 19 00:57:40 aries kernel: Code: 75 f1 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 48 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 df 83 f9 08 73 e8 eb c9 a4 c3 >
    Dec 19 00:57:40 aries kernel: RSP: 0018:ffffba0583e83c60 EFLAGS: 00010206
    Dec 19 00:57:40 aries kernel: RAX: 00007fee025ab010 RBX: 0000000000001000 RCX: 0000000000000010
    Dec 19 00:57:40 aries kernel: RDX: 0000000000000000 RSI: ffffa0132bf84ff0 RDI: 00007fee025ab000
    Dec 19 00:57:40 aries kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000001820000
    Dec 19 00:57:40 aries kernel: R10: 000000000000000f R11: 0000000001820000 R12: ffffba0583e83e40
    Dec 19 00:57:40 aries kernel: R13: 0000000000001000 R14: ffffa0132bf84000 R15: 0000000000000000
    Dec 19 00:57:40 aries kernel: copyout+0x20/0x30
    Dec 19 00:57:40 aries kernel: _copy_to_iter+0x5e/0x4a0
    Dec 19 00:57:40 aries kernel: copy_page_to_iter+0x8b/0x140
    Dec 19 00:57:40 aries kernel: filemap_read+0x1d3/0x360
    Dec 19 00:57:40 aries kernel: vfs_read+0x201/0x350
    Dec 19 00:57:40 aries kernel: ksys_read+0x6f/0xf0
    Dec 19 00:57:40 aries kernel: do_syscall_64+0x60/0x90
    Dec 19 00:57:40 aries kernel: ? exc_page_fault+0x7f/0x180
    Dec 19 00:57:40 aries kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Dec 19 00:57:40 aries kernel: RIP: 0033:0x7fee11a3d0ea
    Dec 19 00:57:40 aries kernel: Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 79 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 3d 00 >
    Dec 19 00:57:40 aries kernel: RSP: 002b:00007ffcd4118240 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    Dec 19 00:57:40 aries kernel: RAX: ffffffffffffffda RBX: 0000000000553f88 RCX: 00007fee11a3d0ea
    Dec 19 00:57:40 aries kernel: RDX: 0000000000553f88 RSI: 00007fee020aa010 RDI: 000000000000004b
    Dec 19 00:57:40 aries kernel: RBP: 00007ffcd4118260 R08: 0000000000000000 R09: 0000000000000000
    Dec 19 00:57:40 aries kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000553f88
    Dec 19 00:57:40 aries kernel: R13: 0000000000000027 R14: 00007fee020aa010 R15: 0000000000000001
    Dec 19 00:57:40 aries kernel:
    Dec 19 00:57:40 aries kernel: —[ end trace 0000000000000000 ]—
    ————————–

    Launching a guest VM works fine, but the kernel logs this message, and here the CPU stall begins (note in this case it’s CPU 0):

    Dec 19 03:20:22 aries kernel: ————[ cut here ]————
    Dec 19 03:20:22 aries kernel: WARNING: CPU: 0 PID: 5256 at kernel/rcu/tree_exp.h:787 rcu_exp_handler+0x35/0xe0
    Dec 19 03:20:22 aries kernel: Modules linked in: bluetooth vmnet(OE) parport_pc vmmon(OE) snd_seq_dummy snd_hrtimer nf_nat_ftp nf_conntrack_ftp nf_conntrack_netbios_ns nf_co>
    Dec 19 03:20:22 aries kernel: videobuf2_common dell_smm_hwmon mac80211 snd_hda_core kvm libarc4 videodev dell_wmi snd_hwdep irqbypass snd_seq snd_seq_device mc rapl snd_pcm>
    Dec 19 03:20:22 aries kernel: Unloaded tainted modules: vmnet(OE):2 vmmon(OE):2 [last unloaded: vmnet(OE)]
    Dec 19 03:20:22 aries kernel: CPU: 0 PID: 5256 Comm: vmx-vcpu-0 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 03:20:22 aries kernel: Hardware name: Dell Inc. Precision M4700/082H3V, BIOS A19 11/30/2018
    Dec 19 03:20:22 aries kernel: RIP: 0010:rcu_exp_handler+0x35/0xe0
    Dec 19 03:20:22 aries kernel: Code: 55 65 48 8b 2c 25 00 39 03 00 53 8b 85 34 08 00 00 48 c7 c3 40 50 03 00 65 48 03 1d 95 c9 e4 66 4c 8b 63 18 85 c0 74 0d 7f 58 0b 5b >
    Dec 19 03:20:22 aries kernel: RSP: 0018:ffffba0580003f98 EFLAGS: 00010086
    Dec 19 03:20:22 aries kernel: RAX: 00000000fffff742 RBX: ffffa0178e235040 RCX: ffffa0178e375400
    Dec 19 03:20:22 aries kernel: RDX: 0000000000000000 RSI: 00000000282149e6 RDI: 0000000000000000
    Dec 19 03:20:22 aries kernel: RBP: ffffa012b67aa8c0 R08: 0000000000000002 R09: 0000000000000000
    Dec 19 03:20:22 aries kernel: R10: 0000000000000000 R11: ffffba0580003ff8 R12: ffffffff9b348ac0
    Dec 19 03:20:22 aries kernel: R13: ffffffff991cd040 R14: 0000000000000000 R15: 0000000000000000
    Dec 19 03:20:22 aries kernel: FS: 00007fede41fe6c0(0000) GS:ffffa0178e200000(0000) knlGS:0000000000000000
    Dec 19 03:20:22 aries kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 19 03:20:22 aries kernel: CR2: 00007fb3800160c0 CR3: 0000000368ce2001 CR4: 00000000001706f0
    Dec 19 03:20:22 aries kernel: Call Trace:
    Dec 19 03:20:22 aries kernel:
    Dec 19 03:20:22 aries kernel: ? rcu_exp_handler+0x35/0xe0
    Dec 19 03:20:22 aries kernel: ? __warn+0x81/0x130
    Dec 19 03:20:22 aries kernel: ? rcu_exp_handler+0x35/0xe0
    Dec 19 03:20:22 aries kernel: ? report_bug+0x171/0x1a0
    Dec 19 03:20:22 aries kernel: ? handle_bug+0x3c/0x80
    Dec 19 03:20:22 aries kernel: ? exc_invalid_op+0x17/0x70
    Dec 19 03:20:22 aries kernel: ? asm_exc_invalid_op+0x1a/0x20
    Dec 19 03:20:22 aries kernel: ? __pfx_rcu_exp_handler+0x10/0x10
    Dec 19 03:20:22 aries kernel: ? rcu_exp_handler+0x35/0xe0
    Dec 19 03:20:22 aries kernel: __flush_smp_call_function_queue+0x10c/0x410
    Dec 19 03:20:22 aries kernel: __sysvec_call_function_single+0x1c/0xc0
    Dec 19 03:20:22 aries kernel: sysvec_call_function_single+0x6d/0x90
    Dec 19 03:20:22 aries kernel:
    Dec 19 03:20:22 aries kernel:
    Dec 19 03:20:22 aries kernel: asm_sysvec_call_function_single+0x1a/0x20
    Dec 19 03:20:22 aries kernel: RIP: 0010:Task_Switch+0x182b/0x1c20 [vmmon]
    Dec 19 03:20:22 aries kernel: Code: e9 80 f0 ff ff cd 38 e9 79 f0 ff ff cd 29 e9 72 f0 ff ff cd 55 e9 6b f0 ff ff cd 54 e9 64 f0 ff ff cd fc e9 5d f0 ff ff cd fb 56 f0 >
    Dec 19 03:20:22 aries kernel: RSP: 0018:ffffba0580817af0 EFLAGS: 00000046
    Dec 19 03:20:22 aries kernel: RAX: 0000000000000065 RBX: ffffa012b864dc58 RCX: ffffa012b864dce8
    Dec 19 03:20:22 aries kernel: RDX: 00000000000000fb RSI: 0000000000000001 RDI: ffffba05805e3000
    Dec 19 03:20:22 aries kernel: RBP: ffffa012b864dc00 R08: 0000000080050033 R09: ffffffffffffffff
    Dec 19 03:20:22 aries kernel: R10: 00007fede41fe6c0 R11: 0000000000000040 R12: 0000000000000000
    Dec 19 03:20:22 aries kernel: R13: 0000000000000287 R14: ffffba05805e36f0 R15: ffffba05805e3000
    Dec 19 03:20:22 aries kernel: Vmx86_RunVM+0xe1/0x700 [vmmon]
    Dec 19 03:20:22 aries kernel: ? __pfx_pollwake+0x10/0x10
    Dec 19 03:20:22 aries kernel: ? avc_has_extended_perms+0x237/0x520
    Dec 19 03:20:22 aries kernel: LinuxDriver_Ioctl+0xabc/0x1310 [vmmon]
    Dec 19 03:20:22 aries kernel: ? ioctl_has_perm.constprop.0.isra.0+0xda/0x130
    Dec 19 03:20:22 aries kernel: ? __x64_sys_ioctl+0x97/0xd0
    Dec 19 03:20:22 aries kernel: __x64_sys_ioctl+0x97/0xd0
    Dec 19 03:20:22 aries kernel: do_syscall_64+0x60/0x90
    Dec 19 03:20:22 aries kernel: ? syscall_exit_to_user_mode+0x2b/0x40
    Dec 19 03:20:22 aries kernel: ? do_syscall_64+0x6c/0x90
    Dec 19 03:20:22 aries kernel: ? exit_to_user_mode_prepare+0x142/0x1f0
    Dec 19 03:20:22 aries kernel: ? syscall_exit_to_user_mode+0x2b/0x40
    Dec 19 03:20:22 aries kernel: ? do_syscall_64+0x6c/0x90
    Dec 19 03:20:22 aries kernel: ? do_syscall_64+0x6c/0x90
    Dec 19 03:20:22 aries kernel: ? do_syscall_64+0x6c/0x90
    Dec 19 03:20:22 aries kernel: ? do_syscall_64+0x6c/0x90
    Dec 19 03:20:22 aries kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    Dec 19 03:20:22 aries kernel: RIP: 0033:0x7fee11a42ecd
    Dec 19 03:20:22 aries kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 c2 3d >
    Dec 19 03:20:22 aries kernel: RSP: 002b:00007fede41fa820 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    Dec 19 03:20:22 aries kernel: RAX: ffffffffffffffda RBX: 000055f50244b8b8 RCX: 00007fee11a42ecd
    Dec 19 03:20:22 aries kernel: RDX: 0000000000000000 RSI: 00000000000007d8 RDI: 000000000000000f
    Dec 19 03:20:22 aries kernel: RBP: 00007fede41fa870 R08: 0000000000000019 R09: 0000000000000000
    Dec 19 03:20:22 aries kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000015a
    Dec 19 03:20:22 aries kernel: R13: 00007fee03ff7388 R14: 00007fee03ff73a0 R15: 00007ffcd4118040
    Dec 19 03:20:22 aries kernel:
    Dec 19 03:20:22 aries kernel: —[ end trace 0000000000000000 ]—
    ————————–

    Sixty seconds later, the first CPU stall warning (CPU 0) is logged:

    Dec 19 03:21:22 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 60065 jiffies s: 925 root: 0x1/.
    Dec 19 03:21:22 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:21:22 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:21:22 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    ————————–

    Distro kernels use 3x the initial timeout for repeated warnings, so every three minutes there is another message:

    Dec 19 03:21:22 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 60065 jiffies s: 925 root: 0x1/.
    Dec 19 03:21:22 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:21:22 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:21:22 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 03:24:22 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 240289 jiffies s: 925 root: 0x1/.
    Dec 19 03:24:22 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:24:22 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:24:22 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 03:27:22 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 420513 jiffies s: 925 root: 0x1/.
    Dec 19 03:27:22 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:27:22 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:27:22 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 03:30:23 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 600737 jiffies s: 925 root: 0x1/.
    Dec 19 03:30:23 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:30:23 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:30:23 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 03:33:23 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 780961 jiffies s: 925 root: 0x1/.
    Dec 19 03:33:23 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:33:23 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:33:23 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    ————————–

    Occasionally, the stalled CPU will be in a non-idle state and actually provide a backtrace:

    Dec 19 03:36:23 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 961185 jiffies s: 925 root: 0x1/.
    Dec 19 03:36:23 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 03:36:23 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 03:36:23 aries kernel: NMI backtrace for cpu 0
    Dec 19 03:36:23 aries kernel: CPU: 0 PID: 6326 Comm: kworker/u16:3 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 03:36:23 aries kernel: Hardware name: Dell Inc. Precision M4700/082H3V, BIOS A19 11/30/2018
    Dec 19 03:36:23 aries kernel: Workqueue: 0x0 (events_unbound)
    Dec 19 03:36:23 aries kernel: RIP: 0010:load_balance+0x9f3/0x1000
    Dec 19 03:36:23 aries kernel: Code: f8 0f 0f 87 ca 01 00 00 85 c9 0f 85 c2 01 00 00 4c 89 ff e8 8f 48 6b 00 84 c0 74 0e 49 8b 17 49 8b 47 08 48 89 42 08 48 89 10 8b 9c >
    Dec 19 03:36:23 aries kernel: RSP: 0018:ffffba058a777c20 EFLAGS: 00000002
    Dec 19 03:36:23 aries kernel: RAX: ffffa0128cb180b8 RBX: ffffa012d362d180 RCX: ffffa0128cb180b8
    Dec 19 03:36:23 aries kernel: RDX: ffffa0178e374c38 RSI: ffffa012d362d238 RDI: ffffa012d362d238
    Dec 19 03:36:23 aries kernel: RBP: 0000000000000000 R08: ffffa012d362d238 R09: ffffffffffffffc0
    Dec 19 03:36:23 aries kernel: R10: 0000000000000002 R11: 0000000000000008 R12: ffffa0178e374140
    Dec 19 03:36:23 aries kernel: R13: ffffba058a777cf0 R14: ffffa0178e374c38 R15: ffffa012d362d238
    Dec 19 03:36:23 aries kernel: FS: 0000000000000000(0000) GS:ffffa0178e200000(0000) knlGS:0000000000000000
    Dec 19 03:36:23 aries kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 19 03:36:23 aries kernel: CR2: 00007f23d7b05000 CR3: 00000007a7222005 CR4: 00000000001706f0
    Dec 19 03:36:23 aries kernel: Call Trace:
    Dec 19 03:36:23 aries kernel:
    Dec 19 03:36:23 aries kernel: ? nmi_cpu_backtrace+0x99/0x110
    Dec 19 03:36:23 aries kernel: ? nmi_cpu_backtrace_handler+0x11/0x20
    Dec 19 03:36:23 aries kernel: ? nmi_handle+0x61/0x150
    Dec 19 03:36:23 aries kernel: ? default_do_nmi+0x40/0x100
    Dec 19 03:36:23 aries kernel: ? exc_nmi+0x139/0x1c0
    Dec 19 03:36:23 aries kernel: ? end_repeat_nmi+0x16/0x67
    Dec 19 03:36:23 aries kernel: ? load_balance+0x9f3/0x1000
    Dec 19 03:36:23 aries kernel: ? load_balance+0x9f3/0x1000
    Dec 19 03:36:23 aries kernel: ? load_balance+0x9f3/0x1000
    Dec 19 03:36:23 aries kernel:
    Dec 19 03:36:23 aries kernel:
    Dec 19 03:36:23 aries kernel: ? sched_clock+0x10/0x30
    Dec 19 03:36:23 aries kernel: newidle_balance+0x1d3/0x430
    Dec 19 03:36:23 aries kernel: pick_next_task_fair+0x41/0x5c0
    Dec 19 03:36:23 aries kernel: __schedule+0x192/0x1410
    Dec 19 03:36:23 aries kernel: schedule+0x5e/0xd0
    Dec 19 03:36:23 aries kernel: worker_thread+0x19d/0x3a0
    Dec 19 03:36:23 aries kernel: ? __pfx_worker_thread+0x10/0x10
    Dec 19 03:36:23 aries kernel: kthread+0xe8/0x120
    Dec 19 03:36:23 aries kernel: ? __pfx_kthread+0x10/0x10
    Dec 19 03:36:23 aries kernel: ret_from_fork+0x34/0x50
    Dec 19 03:36:23 aries kernel: ? __pfx_kthread+0x10/0x10
    Dec 19 03:36:23 aries kernel: ret_from_fork_asm+0x1b/0x30
    Dec 19 03:36:23 aries kernel:
    ————————–

    The backtraces often involve different system-level threads:

    Dec 19 03:45:24 aries kernel: NMI backtrace for cpu 0
    Dec 19 03:45:24 aries kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 04:30:27 aries kernel: NMI backtrace for cpu 0
    Dec 19 04:30:27 aries kernel: CPU: 0 PID: 5476 Comm: kworker/0:0 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 04:42:28 aries kernel: NMI backtrace for cpu 0
    Dec 19 04:42:28 aries kernel: CPU: 0 PID: 7104 Comm: kworker/u16:2 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 05:03:30 aries kernel: NMI backtrace for cpu 0
    Dec 19 05:03:30 aries kernel: CPU: 0 PID: 1536 Comm: rtkit-daemon Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 05:24:31 aries kernel: NMI backtrace for cpu 0
    Dec 19 05:24:31 aries kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 05:33:32 aries kernel: NMI backtrace for cpu 0
    Dec 19 05:33:32 aries kernel: CPU: 0 PID: 2297 Comm: xfwm4 Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    Dec 19 05:36:32 aries kernel: NMI backtrace for cpu 0
    Dec 19 05:36:32 aries kernel: CPU: 0 PID: 7450 Comm: systemd-userwor Tainted: P W OE 6.6.6-100.fc38.x86_64 #1
    ————————–

    Unloading the vmware module doesn’t resolve the stall. Eventually, something critical to system operation will get blocked by the stalled process and the only resolution is a hard power-off. One of the first indicators I had that a problem existed was hearing the exhaust fans spin up to high speed when the system was sitting idle. I found that the ‘sadc’ process was hung, spinning at 100% utilization of one core.

    Dec 19 22:46:49 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 69986977 jiffies s: 925 root: 0x1/.
    Dec 19 22:46:49 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 22:46:49 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 22:46:49 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 22:48:31 aries kernel: sdb1: Can’t mount, would change RO state
    Dec 19 22:48:32 aries kernel: /dev/vmmon[15351]: Module vmmon: unloaded
    Dec 19 22:49:49 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 70167201 jiffies s: 925 root: 0x1/.
    Dec 19 22:49:49 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 22:49:49 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 22:49:49 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 22:52:49 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 70347425 jiffies s: 925 root: 0x1/.
    Dec 19 22:52:49 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 22:52:49 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 22:52:49 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 22:55:50 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 70527649 jiffies s: 925 root: 0x1/.
    Dec 19 22:55:50 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 22:55:50 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 22:55:50 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    Dec 19 22:58:50 aries kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-…D } 70707873 jiffies s: 925 root: 0x1/.
    Dec 19 22:58:50 aries kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Dec 19 22:58:50 aries kernel: Sending NMI from CPU 5 to CPUs 0:
    Dec 19 22:58:50 aries kernel: NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
    ————————–

    Even SysRq commands are unavailable, with the exceptions of “sync” and “remount”:
    Dec 20 00:24:47 aries kernel: sysrq: Emergency Sync
    Dec 20 00:24:52 aries kernel: sysrq: Emergency Remount R/O
    Jan 06 23:18:23 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:18:35 aries kernel: sysrq: Emergency Sync
    Jan 06 23:19:04 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:10 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:13 aries kernel: sysrq: Emergency Sync
    Jan 06 23:19:17 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:24 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:31 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:35 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:43 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:19:48 aries kernel: sysrq: Emergency Sync
    Jan 06 23:19:52 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:20:38 aries kernel: sysrq: This sysrq operation is disabled.
    Jan 06 23:20:41 aries kernel: sysrq: This sysrq operation is disabled.
    ————————–

    Interestingly, when this is taking place, even before it affects any programmatic use of the system, it does affect the login subsystem. Any login operation, local or remote, including using sudo from an active terminal session, results in significant delays (30+ seconds) or even timing out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.