I have seen multiple times how HDDs fail, both in others' servers, and in my own computers. They usually develop unreadable sectors, and the rest of data can be recovered under Linux using tools like
ddrescue to another HDD. But I have never seen any SSD failures myself before. Until today.
Well, this case is not that different.
Today, my desktop computer failed to boot, with some error messages from
systemd about failing to start services. I thought that it might be a one-off error, rebooted it, only to find out that the root partition (XFS on LUKS on /dev/sda2) failed to mount. The error in
dmesg told me to run
xfs_repair, which I did not do initially.
It did mount with the
ro,norecovery options, but I rebooted the system afterwards instead of copying the files somewhere immediately. It was a stupid move, and a lesson for the future.
Then I ran
xfs_repair, but it complained a lot about I/O errors, and afterwards, the filesystem was no longer mountable even with the
The majority of sectors are still readable, so, as of now,
xfs_repair still looks like a valid recovery strategy. I will update the blog post if it isn't. Even if it isn't, only a day of work is lost.
Update 1: it worked, but I had to run
xfs_repair twice, and the
node_modules directory from one project ended up in
lost+found. So nothing important was apparently lost.
And here is what smartctl in my rescue system says about the drive.
$ sudo smartctl -a /dev/sda smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.4-arch2-1] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Indilinx Barefoot 3 based SSDs Device Model: OCZ-VECTOR Serial Number: OCZ-Z5CB4KC20X0ZG7F8 LU WWN Device Id: 5 e83a97 27d603391 Firmware Version: 3.0 User Capacity: 512 110 190 592 bytes [512 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Feb 9 15:09:11 2022 +05 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x1d) SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x00) Error logging NOT supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 0) minutes. Extended self-test routine recommended polling time: ( 0) minutes. SMART Attributes Data Structure revision number: 18 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Runtime_Bad_Block 0x0000 033 033 000 Old_age Offline - 33 9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 31212 12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 5821 171 Avail_OP_Block_Count 0x0000 080 080 000 Old_age Offline - 72682832 174 Pwr_Cycle_Ct_Unplanned 0x0000 100 100 000 Old_age Offline - 406 195 Total_Prog_Failures 0x0000 100 100 000 Old_age Offline - 0 196 Total_Erase_Failures 0x0000 100 100 000 Old_age Offline - 0 197 Total_Unc_Read_Failures 0x0000 100 100 000 Old_age Offline - 33 208 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 317 210 SATA_CRC_Error_Count 0x0000 100 100 000 Old_age Offline - 60 224 In_Warranty 0x0000 100 100 000 Old_age Offline - 0 233 Remaining_Lifetime_Perc 0x0000 090 090 000 Old_age Offline - 90 241 Host_Writes_GiB 0x0000 100 100 000 Old_age Offline - 53191 242 Host_Reads_GiB 0x0000 100 100 000 Old_age Offline - 34251 249 Total_NAND_Prog_Ct_GiB 0x0000 100 100 000 Old_age Offline - 7808432714 Warning! SMART ATA Error Log Structure error: invalid SMART checksum. SMART Error Log Version: 1 No Errors Logged Warning! SMART Self-Test Log Structure error: invalid SMART checksum. SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Selective Self-tests/Logging not supported
So, despite the I/O errors, the drive still considers itself healthy (SMART overall-health self-assessment test result: PASSED). What a liar.
Update 2: after a reboot, the system does not detect the drive at all, even in the BIOS. So I was lucky to be able to copy all the data just in time.