Kernel – 4.11-rc1 and BTRFS – a Warning..?
I encountered a catastrophic problem with btrfs formatted system disk partition corruption, shortly after updating a test system to Kernel 4.11-rc1.
To put this in context, this may be an isolated incident, and I have not seen problems – so far – mentioned elsewhere.. But.. there a quite a lot of BTRFS changes on 4.11, and – after the usual tests – I confirmed that system memory was OK (running memtest86+ several times..) and the disk was OK (running smartmontools utilities etc. several times). The disk is only one year old, in any case (WD – 2GB)..
Unfortunately, I did not have a serial console attached to this particular system at the time, and so had to photograph the screen at various times, and then re-type the details below from those photos (!)..
The scenario:
System disk (/dev/sda2) is formatted btrfs..
Updated to Kernel 4.11-rc1.
Within 24 hours, the system froze, and on rebooting, got a critical btrfs error:
................... Mounting /sysroot.... BTRFS critical (device sda2): corrupt node, bad key order: block=368346054656, root=1, slot=192 BTRFS critical (device sda2): corrupt node, bad key order: block=368346054656, root=1, slot=192 BTRFS error (device sda2): failed to read block groups: -5 BTRFS error (device sda2): open_ctree failed [FAILED] Failed to mount /sysroot See 'systemctl status sysroon.mount' for details [DEPEND] Dependency failed for Initrd Root File System. [DEPEND] Dependency failed for Reload Configuration from the Real Root. ...................... Starting Emergency Shell...
Ran:
:/# btrfs check /dev/sda2 Checking filesystem on /dev/sda2 UUID xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxx checking extents bad block 33488896 Errors found in extent allocation tree or chunk allocation ^C :/# btrfs rescue chunk-recover /dev/sda2 Scanning: 66126696448 in dev0
Then got scrolling screens full of:
........... Deleting bad dir index [537997,96,160] root 5 Deleting bad dir index [537997,96,168] root 5 Deleting bad dir index [537997,96,170] root 5 Deleting bad dir index [537997,96,180] root 5 Deleting bad dir index [537997,96,186] root 5 Deleting bad dir index [537997,96,190] root 5 Deleting bad dir index [537997,96,198] root 5 ................ continuing...
Then more scrolling screens full of:
............... Trying to rebuild inode:1658592 root 5 inode 1658592 error 2001, no inode item, link count wrong unresolved ref dir 198531 index 0 namelen 18 name gtk-indent-ltr.png filetype 7 errors 6, no dir index, no inode ref Trying to rebuild inode:1658593 root 5 inode 1658593 error 2001, no inode item, link count wrong unresolved ref dir 198531 index 0 namelen 18 name gtk-indent-rtl.png filetype 7 errors 6, no dir index, no inode ref ............... continuing....
Then more scrolling screens full of:
.............. repairing missing dir index item for inode 30734433 repairing missing dir index item for inode 30734434 repairing missing dir index item for inode 30734435 repairing missing dir index item for inode 30734436 repairing missing dir index item for inode 30734437 repairing missing dir index item for inode 30734438 repairing missing dir index item for inode 30734439 ............ continuing...
Then more scrolling screens full of:
.............. Deleting bad dir index [363308,96,63443] root 5 Deleting bad dir index [363396,96,12025] root 5 The following tree block(s) is corrupted in tree 5: tree block bytenr: 44061900800, level: 1, node key: (365821249776, 168, 45056) Try to repair the btree for root 5 Btree for root 5 is fixed Deleting bad dir index [364225,96,3382] root 5 Deleting bad dir index [363308,96,63425] root 5 Deleting bad dir index [363308,96,63427] root 5 Deleting bad dir index [363308,96,63427] root 5 ........... continuing ...........
The ‘recovery’ kept running, although appearing to stall, several times, but I just left it and it ran for almost three days in total, and finally ended:
.............. Btree for root 5 is fixed root 5 root dir 256 error root 5 inode 256 errors 200, dir isize wrong reset isize for dir 2139040 root 5 reset isize for dir 2139040 root 5 .................. Trying to rebuild inode:33318043 moving file 'lost+found' to 'lost+found' dir since it has no valid backref Fixed the nlink of inode 33318043 found 583909376 bytes used err is 1 total csum bytes: 0 total tree bytes: 1884160 total fs tree bytes: 0 total extent tree bytes: 1474560 btree space waste bytes: 758524 file data blocks allocated: 201064448 referenced 201064448 :/# :/#
After all that, I rebooted:
................... Mounting /sysroot.... BTRFS critical (device sda2): corrupt node, bad key order: block=44061900800, root=1, slot=192 BTRFS critical (device sda2): corrupt node, bad key order: block=44061900800, root=1, slot=192 BTRFS error (device sda2): failed to read block groups: -5 BTRFS error (device sda2): open_ctree failed [FAILED] Failed to mount /sysroot See 'systemctl status sysroon.mount' for details [DEPEND] Dependency failed for Initrd Root File System. [DEPEND] Dependency failed for Reload Configuration from the Real Root. .................... Entering emergency mode..............
So… The btrfs recovery was a waste of three days, and did not fix the problem.. I will now have to re-create the data from another source.. Fortunately this was ‘just’ the system disk partition and no irreplaceable recent user data was lost, apart from /root…
I should mention as well, that all other btrfs-formatted partitions on other systems, running 4.10.x and earlier, have been error-free, and reliable. And… I have not tried 4.11-rc2 with BTRFS, yet..
Robert Gadsdon. March 14, 2017.
After decades of no problems running Linux, except for the long ago xfs fs eating bug, I just had something very similar happen with ext4 on a nvme drive on Ubuntu 4.10.0-20, which likely contains 4.11 backported ‘fixes’. I suspect its probably not isolated to btrfs.
BTW – Thanks for your VMware patches!