OpenStack

NVMe Burnout

Scenario

The machine hangs. To be fair the applications kept running, notably Ceph, but something was spinning and you couldn't log in to see what. Even on the console.

My own fault as the (root) disk isn't mirrored but then, arguably, it couldn't be as there isn't enough room!

I then waste half a day trying to figure out if the gah! desktop BIOS' Intel RST technology is getting in the way. The BIOS does get in the way, for other reasons, but in between the devices coming and going depending on reboot, power cycle, BIOS setting etc., the NVMe Drive was being reported as 1GB or 128GB. Hmm, interesting and, as it turns out, symptomatic (appearing as a 1GB drive, not the 128GB bit).

Linux, for that matter was simply reporting an inability to read or write to the device.

Finally I strip it down and plug the NVMe into another box with two M.2 slots and discover the nvme-cli toolset.

Diagnosis

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     2J4420076833         ADATA SX8200PNP                          1           3.98  GB /   1.02  TB    512   B +  0 B   SS0411B
/dev/nvme1n1     S1XVNYAGA00068       SAMSUNG MZVPV128HDGM-00000               1         128.04  GB / 128.04  GB    512   B +  0 B   ERRORMOD

Scroll to the right and checkout the Firmware Revision for the second drive: ERRORMOD. Ooh, err. Doesn't sound promising. It is actually a shortened ERRORMODE.

There are suggestions you can use nvme reset /dev/nvme1 and subsystem-reset and others but actually the one to watch is smart-log, the equivalent of smartctl.

# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 36 C
available_spare                     : 0%
available_spare_threshold           : 10%
percentage_used                     : 255%
data_units_read                     : 24,617,317
data_units_written                  : 45,824,484
host_read_commands                  : 248,276,781
host_write_commands                 : 1,340,431,412
controller_busy_time                : 38,672
power_cycles                        : 204
power_on_hours                      : 35,670
unsafe_shutdowns                    : 115
media_errors                        : 1
num_err_log_entries                 : 1
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0

Luckily there's an NVMe specification document to help explain some of this but in practice I've hammered my NVMe drive to death. In particular, find Figure 79 on page 91 for an explanation of most of these numbers.

Let's take a look at some of them.

power_on_hours

This doesn't need a 212 page document to explain itself but 35,670 does sound like a lot of hours. It is just over 4 years which handily places it in January 2016 when I bought it.

available_spare

This is the main clue about the deadness of our disk. The official line is:

Contains a normalized percentage (0 to 100%) of the remaining spare capacity available.

So, uh, nothing left, then.

"Capacity" seems to be a bit disingenuous as this isn't a measure of blocks in use versus blocks in the drive but rather ability to continue to do work. Perhaps "spare" denotes something like the old HDD trick of having extra sectors that could be mapped in when a regular sector became unreadable. At least I think so as I can't access this drive to see what's in use. I can look at its sibling drive in the machine next door, though.

It is reporting 99% for available_spare and whilst the actual disk usage is hard to figure (a large chunk was being used for Ceph caching and, so far as the controller is concerned, still is) and of the file systems to hand I'm using 4GB of the 256GB drive which is 1.5% on its own.

percentage_used

255% Ouch!

Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer's prediction of NVM life.

Actually, 255% is a lie too as "255" is the largest number you can have. So, probably someway north of that in practice.

The sibling drive is running at 75%. It is twice the size so maybe you could argue that it is the equivalent of 150% of the life of a drive half the size.

data_units_read / data_units_written

Here a unit is 1000 512 byte blocks, so 0.5MB. If my maths is any good that looks like we've read 12TB and written 23TB.

The sibling drive has done more work, 25TB and 60TB respectively.

host_read_commands / host_write_commands

No deep introspection required though it is interesting that there is a 1:5 ratio between the two.

The sibling drive is about 1G reads to 3G writes. An order of magnitude different.

power_cycles / unsafe_shutdowns

Have I power-cycled it 200 times? Maybe. I guess the unsafe shutdowns reflect me fighting with the BIOS rather than necessarily pulling the power in normal flight. The NVMe drive will be on and subject to an unsafe shutdown in the BIOS just as much as anything else.

I have pulled the power a few times, though.

The sibling system is running at half those numbers.

media_errors

I'm slightly surprised these's only been one media error if it has consumed all of its "spare capacity".

num_err_log_entries

I went looking for this straight-away but unfortunately the error-log sub-command only reports success. Go figure!

Summary

Recalling my enterprise corporate hat wearing, when buying some SSDs for servers I was pressed, rather heavily, for whether they would be read-intensive or write-intensive. After feeling aggrieved that I didn't have the answer to such a question ready to hand I deemed that even if my colleagues compiled stuff all day every day they would hardly trouble the 1 Drive Writes per Day rating of the read-intensive SSDs.

In my case the drive (decidedly non-enterprise) has written 23TB in its 4 years of life (technically 35670 / 24 == 1486 days). That's approximately 15GB per day or about 0.12 DWPD.

Is that any good? Is it what we might expect? Searching for the answer is quite hard. For a start the manufacturers like to quote TBW (Tera Bytes Written -- not Total Bytes!) for the lifetime of the drive. Even that isn't easy to find. I'm seeing numbers in the low 3 figures for the SM951 (although where those figures ultimately come from I don't know as I can't find a data sheet) but those were for larger drives. So maybe this is ballpark for a small NVMe drive.

Ultimately, though, it's lasted four years and in the last year it has been hit by Ceph chuntering away all day and night. I should have gotten these stats before I started!

August 2020 Update

Boom the sibling drive dies. Coincident with a power outage so I didn't realise at first. Of interest is that the BIOS absolutely refused to see the presence of the NVMe drive. However, when you booted Linux off a USB stick it would see the NVMe drive -- though it got IO errors. As a side note, Linux then issued messages about a USB cable -- no USB cable involved, the USB stick is directly in the back panel. Out of curiosity, I plugged the stick into a USB adaptor and Linux is perfectly happy to boot. Go figure.

Similar kinds of numbers:

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S1XWNYAGB32830       SAMSUNG MZVPV256HDGL-00000               1         256.06  GB / 256.06  GB    512   B +  0 B   ERRORMOD

# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 52 C
available_spare                     : 0%
available_spare_threshold           : 10%
percentage_used                     : 255%
data_units_read                     : 49,330,158
data_units_written                  : 121,778,826
host_read_commands                  : 945,941,088
host_write_commands                 : 2,993,398,144
controller_busy_time                : 71,651
power_cycles                        : 121
power_on_hours                      : 39,629
unsafe_shutdowns                    : 66
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

power_on_hours suggests February 2016 again. 255% percentage_used again! Data units read and written are 24TB read and 61TB written.

I suspect the damage was done when it was being the journal for Ceph but it's struggled on for a bit.

(I really must start logging this information!)

Document Actions