Did you know your drive was SMART? Actually: Self-Monitoring, Analysis, and Reporting Technology. It can be used to gather information about your hard drives and offers some additional information about the status of your storage devices. It can also be used with other tools to help predict drive failure.
SMART (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system for storage devices, usually hard drives, that provides both information about the status of a drive as well as the ability to run self tests. It can be used by storage administrators to check on the status of their storage devices and force self-tests to determine the state of the device. While some people advocate using the data for predicting drive failure giving you time to get data off of the drive in the event of imminent failure, other references say that some SMART data may not be the best predictor of failure.
SMART and SMART – What is SMART?
Since you are reading this article it is likely that you understand that data is important. This also means that you have backups of your most important data and you do daily backups – right? The reason we have backups is that hard drives fail for various and sometimes mysterious, reasons. From an administrative point of view it would be nice to be able to correlate drive failures with certain drive characteristics or with load or even track a batch of drives. Perhaps even better, it would be nice to be able to predict if a drive is failing or when a drive might fail. Then you can make sure you have all of the data backed up and you can remove the failing drive and replace it with a new one. To do any or all of this we need data about the storage devices.
IBM was the first company to add some monitoring and information capability to their drives in 1992. Other vendors followed suit and then Compaq led an effort to standardize on an approach to monitoring drive health and reporting it. This drive for standardization led to S.M.A.R.T. (this is the correct abbreviation rather than SMART but it’s not nearly as easy to type). Over time SMART capability has been added to many hard drives including PATA, SATA, the many varieties of SCSI, and SAS. The standard was based on the approach that the drives would measure the appropriate health parameters and the results would then be available for the OS or other monitoring tools. But, each drive vendor was free to decide which parameters were to be monitored and what their thresholds would be (a threshold is the point at which the drive has “failed”).
For a drive to be considered “SMART” all it has to have is the ability to signal between the internal drive sensors and the host computer. There is nothing in the standard about what sensors are in the drive nor how this data is exposed to the user. But at the lowest level SMART provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the SMART status. Many times the “drive is failed status” doesn’t indicate that the drive has actually failed but that the drive may not meet its specifications.
But it is fairly safe to assume that all modern drives have, in addition to the SMART status, SMART attributes. These attributes are completely up to the drive manufacturers and consequently are not standard. So each type of drive has to be scanned for various SMART attributes and possible values. In addition to SMART attributes the drives may also contain some self-tests with the results stored in the self-test log. These logs may be scanned or read to track the state of drive. Moreover you can also tell the drives to run self tests.
The difficulty in reading the SMART attributes is that the attributes have a threshold value beyond which the drive will not pass under ordinary conditions (sometimes lower values are better and sometimes larger values are better). But these threshold values are only known to the manufacturer and may not be published. In addition, each attribute returns a raw value who’s measurement is up to the drive manufacturer and a normalized value that has a value from 1 to 253. A “normal” attribute value is completely up to the manufacturer as well. So you can see that it’s not always easy getting SMART attributes from various drives nor is it easy to interpret the values.
For most of the drives sold in the last few years there are a number of SMART attributes. The article about SMART on Wikipedia has a pretty good list of common attributes and their meaning. You will notice in the list that some attributes are better when the value is larger and some are better when the value is smaller.
Using many of the SMART attributes one would think that you could predict failure. For example, if the drive was running too hot, then it might be more susceptible to failure. Or if bad sectors were developing quickly one might think the drive was also failing. Perhaps you can use the attributes with some general models of drive failure to predict when drives might fail and then work to minimize the damage.
However using SMART for predictive failure of drives has been a difficult proposition. Google reported a study where they examined over 100,000 drives of various types for correlations between failure and SMART values. The disks are a combination of consumer grade drives (SATA and PATA) with speeds from 5,400 rpm to 7,200 rpm and capacities ranging from 80GB to 400GB. There are several drive manufacturers in the population of drives with at least nine different models in total. The data in the study was collected over a 9 month window.
In the study they monitored the SMART attributes of the population of drives and also which drives failed. Google chose the word “fail” to mean that the drive is not suitable for use in production even if the drive tests good (sometimes the drive would test correct but immediately fail in production). From their study the authors concluded,
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
However, despite the overall message that they had difficult developing correlations, they did find some interesting trends.
- Google agrees with the common view that failure rates are known to be highly correlated with drive models, manufacturers, and age. However, when they normalized the SMART data by the drive model, none of the conclusions changed.
- There was quite a bit of discussion about the correlation between SMART attributes and failure rates. One of the best summaries in the paper is, “Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives. …”.
- Temperature effects are interesting in that high temperatures start affecting older drives (3-4 years or older). But lower temperatures can also increase the failure rate of drives regardless of age.
- A section of the final paragraph of the paper bears repeating here. “We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART.”
- The paper tried to sum all the factors that contributed to drive failure that they observed contributed such as errors or temperature but they still missed about 36% of the drive failures.
The paper gives some good insight into the drive failure rate of a large population of drives. As mentioned previously there is some correlation of drive failure with scan errors but that doesn’t account for all failures of which a large fraction did not show any SMART error signals. It’s also important to mention that the comment in the last paragraph mentions that, “… SMART models are more useful in predicting trends for large aggregate populations than for individual components. … “. However, this should not deter one from watching the SMART error signals and attributes to track the history of the drives in your systems. Again, there appears to be some correlation between scan errors and failure of the drives and this might be useful in your environment.
Before finishing this section I wanted to give out bonus points to anyone who can tell me the origins of the title of this section. Think hard and post to the comments section.
SMART has a great deal of potential for helping administrators. While it may not be the best predictor of drive failure it can give you some indication that drives are having problems and it can definitely give you the history of the drive. Good administrators can use this information and correlate it with workload history to better map the behavior of the drive.
In the next article we’ll explore smartmontools that allows you to examine the SMART attributes of your drives and run self tests.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).