dcsimg

Smartmontools: Ya Mon!

Last article we introduced the SMART capabilities of hard drives (who knew your drives were SMART?). In this article smartmontools, an application for examining the SMART attributes and trigger self tests, is examined.

Notice that the offline testing is enabled every 4 hours. There is some debate about whether the offline tests impact performance. From the smartctl man pages,


This type of test can, in principle, degrade the device performance. The ‘-o on’ option causes this offline testing to be carried out, automatically, on a regular scheduled basis. Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so in practice it has little effect.

So it’s reasonably safe to turn on this option but it’s up to you to decide, based on your workload, how it will impact your performance. For my desktop I don’t worry about the performance impact but for a server, particularly a storage server where the storage could be under a constant load, I might think twice about turning this on. Instead I might rely on a cron job to run offline tests (or perhaps only run them during a maintenance period). Regardless, take the time to test your situation and make the decision of how you often you would run an offline test (but you should run an offline test periodically).

The next smartmontools option to try is the “-c” option. This option prints out the generic SMART “capabilities” of the drive. In this case, capabilities refers to the ability to run tests and store the results in a log. An example of the output from smartctl using the -c option is shown below for /dev/sdb.

[root@test64 laytonjb]# /usr/local/sbin/smartctl -c /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 642) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 119) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


There are some interesting bits from this output. You can read through the various options but one part that is interesting is that the drive is capable of running self-tests and the “short” self-test requires 1 minute and the “extended” or offline self-test takes 119 minutes.

Since this is the first time the drives have been examined using smartmontools, both the short and extended self tests should be run. The output below is for the short self-test.

[root@test64 laytonjb]# /usr/local/sbin/smartctl -t short /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sat Apr 10 22:15:37 2010

Use smartctl -X to abort test.


This command starts the self test. The only way to check if it is finished as well as the results of the test is to use the “-l selftest” option with smartctl.

[root@test64 laytonjb]# /usr/local/sbin/smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1432         -
# 2  Short offline       Completed without error       00%      1432         -
# 3  Short offline       Completed without error       00%      1432         -
# 4  Short offline       Completed without error       00%      1432         -


You can see that I ran the test 4 times (just to be sure). But all four tests completed without error.

We can also invoke the extended (offline) testing in a similar way.

[root@test64 laytonjb]# /usr/local/sbin/smartctl -t long /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 119 minutes for test to complete.
Test will complete after Sat Apr 10 22:24:41 2010

Use smartctl -X to abort test.


Just as with the short self-test, the only way to tell when it’s done is to list the log using the smartctl option “-l selftest”.

[root@test64 laytonjb]# /usr/local/sbin/smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1433         -
# 2  Short offline       Completed without error       00%      1432         -
# 3  Short offline       Completed without error       00%      1432         -
# 4  Short offline       Completed without error       00%      1432         -
# 5  Short offline       Completed without error       00%      1432         -


The extended test took a while to finish but as you can see it completed without error.

We can also search the SMART logs for “errors” with a simple command:

[root@test64 laytonjb]# /usr/local/sbin/smartctl -l error -d sat /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged


The option “-d sat” hasn’t been used before. It simply tells smartctl that the device “-d” is a sata (“sat”) drive. This just prevents smartctl from having to determine the type of drive.

Now that it looks like the drive is good (no errors and SMART is enabled). We can start to probe the drive a little further. Earlier we used the “-c” option to list the test and reporting capabilities of the drive. We can also use the “-a” option to list the vendor specific SMART attributes:

[root@test64 SMARTMONTOOLS]# /usr/local/sbin/smartctl -a /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11 family
Device Model:     ST3500320AS
Serial Number:    9QM5WJ21
Firmware Version: SD15
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun Apr 11 17:32:31 2010 EDT

==> WARNING: There are known problems with these drives,
AND THIS FIRMWARE VERSION IS AFFECTED,
see the following Seagate web pages:

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 642) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 119) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   100   006    Pre-fail  Always       -       86741246
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       82
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   082   060   030    Pre-fail  Always       -       170269847
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1435
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       83
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   063   045    Old_age   Always       -       26 (Lifetime Min/Max 18/26)
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 13 0 0)
195 Hardware_ECC_Recovered  0x001a   023   023   000    Old_age   Always       -       86741246
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1435         -
# 2  Extended offline    Completed without error       00%      1433         -
# 3  Short offline       Completed without error       00%      1432         -
# 4  Short offline       Completed without error       00%      1432         -
# 5  Short offline       Completed without error       00%      1432         -
# 6  Short offline       Completed without error       00%      1432         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


This listing is rather long but has a great deal of information buried in it. The first part of the output we’ve seen before using the “-c” option – it’s the capabilities section. But the second part have we have not seen yet. The table in the second part has labels starting with “ID# ATTRIBUTE_NAME FLAG … ” and contains the vendor specific SMART attributes. The first attribute to examine is the first row, “Raw_Read_Error_Rate.”

The Raw Read Error Rate attribute is the rate of hardware read errors that occurred when reading data from a drive. The value of the attribute is 115, it’s worst value is 110, and the threshold is 006. Does this mean the read error rate is 115 when the threshold is 6? Not necessarily because the absolute values we are examining are meaningless without knowing their definitions. What you should do is track that attribute and see when/if it changes.

There are other attributes that are useful to monitor as well. Here is a sample of the attributes reported by the drive in this article.


  • Reallocated_Sector_Ct: This is the number of reallocated sectors on the drive. Basically this means that there has been a verification error on a specific sector on the drive and that sector is remapped to an area that has spare sectors. Typically the “raw” value is the number of sectors that have been remapped.
  • Seek Error Rate: This is the rate of seek errors of the drive heads.
  • End-to-End Error: This is the number of errors when the data transferred through the drive cache does not match the data at the host. Typically this is measured by a parity calculation.
  • Command Timeout: The number of aborted drive operations due to a drive timeout.

There are other attributes as well. Typically Google will turn up a discussion about them (don’t forget that they vary from manufacturer to manufacturer and drive to drive).

Summary

This article is just a quick introduction to smartmontools which allows Linux users to work with the SMART attributes and capabilities of storage devices. The tool is easy to configure and works quite well for most common drives. However, remember that the SMART attributes are not standard so smartmontools may not know about your particular drive (or RAID card). It may take some work to get it to understand the attributes of your particular drive (don’t hesitate to use the smartmontools mailing list) but when it is included in the smartmontools database, life is a bit easier for querying SMART attributes and capabilities.

SMART can be a great asset for administrators and even home users. It has a great deal of capability and can be used to watch the history of your storage devices. The capability is quite broad and we haven’t even gotten into the smartmontools daemon, smartd. That’s the subject for future articles. In the meantime, take a look at the smartmontools webpage and look at the man pages. Take some time to read through the documentation and then start checking your own storage devices.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62