Wednesday, February 9, 2022

SSD failed

I have seen multiple times how HDDs fail, both in others' servers, and in my own computers. They usually develop unreadable sectors, and the rest of data can be recovered under Linux using tools like ddrescue to another HDD. But I have never seen any SSD failures myself before. Until today.

Well, this case is not that different.

Today, my desktop computer failed to boot, with some error messages from systemd about failing to start services. I thought that it might be a one-off error, rebooted it, only to find out that the root partition (XFS on LUKS on /dev/sda2) failed to mount. The error in dmesg told me to run xfs_repair, which I did not do initially.

It did mount with the ro,norecovery options, but I rebooted the system afterwards instead of copying the files somewhere immediately. It was a stupid move, and a lesson for the future.

Then I ran xfs_repair, but it complained a lot about I/O errors, and afterwards, the filesystem was no longer mountable even with the norecovery option.

The majority of sectors are still readable, so, as of now, ddrescue + xfs_repair still looks like a valid recovery strategy. I will update the blog post if it isn't. Even if it isn't, only a day of work is lost.

Update 1: it worked, but I had to run xfs_repair twice, and the node_modules directory from one project ended up in lost+found. So nothing important was apparently lost.

And here is what smartctl in my rescue system says about the drive.


$ sudo smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.4-arch2-1] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Indilinx Barefoot 3 based SSDs
Device Model:     OCZ-VECTOR
Serial Number:    OCZ-Z5CB4KC20X0ZG7F8
LU WWN Device Id: 5 e83a97 27d603391
Firmware Version: 3.0
User Capacity:    512 110 190 592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb  9 15:09:11 2022 +05
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x1d) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x00)	Error logging NOT supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   0) minutes.
Extended self-test routine
recommended polling time: 	 (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Runtime_Bad_Block       0x0000   033   033   000    Old_age   Offline      -       33
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       31212
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       5821
171 Avail_OP_Block_Count    0x0000   080   080   000    Old_age   Offline      -       72682832
174 Pwr_Cycle_Ct_Unplanned  0x0000   100   100   000    Old_age   Offline      -       406
195 Total_Prog_Failures     0x0000   100   100   000    Old_age   Offline      -       0
196 Total_Erase_Failures    0x0000   100   100   000    Old_age   Offline      -       0
197 Total_Unc_Read_Failures 0x0000   100   100   000    Old_age   Offline      -       33
208 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       317
210 SATA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       60
224 In_Warranty             0x0000   100   100   000    Old_age   Offline      -       0
233 Remaining_Lifetime_Perc 0x0000   090   090   000    Old_age   Offline      -       90
241 Host_Writes_GiB         0x0000   100   100   000    Old_age   Offline      -       53191
242 Host_Reads_GiB          0x0000   100   100   000    Old_age   Offline      -       34251
249 Total_NAND_Prog_Ct_GiB  0x0000   100   100   000    Old_age   Offline      -       7808432714

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

So, despite the I/O errors, the drive still considers itself healthy (SMART overall-health self-assessment test result: PASSED). What a liar.

Update 2: after a reboot, the system does not detect the drive at all, even in the BIOS. So I was lucky to be able to copy all the data just in time.

No comments: