Downtime is a big part of systems administration, but downtime doesn’t have to be painful. A well-planned and well-executed downtime is part of a job well done.
System administrators have no concept of “normal hours.” Constantly on call, there is no day or night, only “production time” (the time when the system needs to be up) or “downtime.”
Infrequently—as a result of planning or statistical anomalies—we system administrators seem to enjoy normal working hours, but it’s simply an illusion. In reality, each of us labors after dark and on weekends, unseen and often in a loud, arctic data center. A lot of our character and skill is built during such impromptu downtimes, beating application and systems software into submission. Such marathon sessions hone our diagnostic skills and tune our intrinsic sense to the rhythms of the environments in which we work. In addition, downtime makes us appreciate when things are quiet.
However, even if crises are common, there’s no need to provoke misfortune. Scheduled downtime precludes trouble, although results can vary.
- The “Quickie” is a scheduled downtime that requires much less time than anticipated. Whether from an overabundance of planning, a lack of resistance, or sheer luck, all maintenance goes smoothly. The upside? We’re done early. The downside? Oodles of unexpired downtime, requiring a lengthy explanation, lest management expect every task to go so effortlessly.
- The “Nightmare” is downtime doomed from the start, usually the result of poor planning, lack of information, or both. Nothing goes right.
- Of course, there is “The Job That Goes as Planned,” which is self-explanatory.
Ideally, all downtime goes as planned, but there are no guarantees. However, there are some techniques to improve the likelihood of that outcome. The next five sections list best practices.
Whether it’s an existing ticketing system or a uniform, structured email message, a standardized change management notice tells your users what to expect, when to expect the interruption, and when to expect resumption. Specify what’s being planned and who is impacted by the change to give the impacted parties a chance to get involved. This isn’t meant to complicate your downtime, but rather to solicit agreement from responsible parties.
For example, if your change is an application upgrade or migration, you’ll want your application and business analysts to fill you in on any requirements they or the end user customer has before the downtime. You don’t want to be surprised during your change window.
A good change management system even helps when downtime is unplanned. Keeping a record of reaction to problems helps you document the solution (if not simply remember what happened). This is important when you want to train that new pimply-faced junior sysadmin wannabe without having to walk him through every little detail of what you do.
Have a Plan
Write out specifically what you intend to do. Let me say that again. Write out specifically what you intend to do, command-by-command. if necessary.
This sounds like an obvious thing to do, but too many sysadmins walk into a downtime with a cloudy notion of specific steps to take. Furthermore, a detailed list helps you more than you know in the wee hours of the morning (when you are bleary-eyed and mentally fatigued from rebuilding LUNs after that online firmware upgrade to the SAN failed unmounting every disk in the environment).
In addition to the work plan, have a contingency plan to restore the environment to its original state. Know ahead of time when to stop or reverse course if things aren’t going according to plan.
You need not include your work plan or contingency plan in the change notice. Offering that kind of detail is probably counterproductive. The change notice is a courtesy, not a call for technical review.
Prepare, Prepare, Prepare
In system administration, it’s preparation, preparation, preparation. Make as many non-impactful changes as possible before the scheduled downtime.
For example, make updates to configuration files beforehand and leave them commented out. When the downtime begins, uncomment the changes and restart the process in question. This saves oodles of time, allowing ample opportunity to back out and turn around.
You may be tempted to automate as much of your maintenance as possible. This is generally a good idea if the maintenance involves a lot of repetitive tasks. However, it is not a good idea if the change involves several individual tasks. You don’t want the change to run away from you if something unexpected pops up. In other words, automate in moderation.
System updates and patches are probably the most common form of systems maintenance performed by a wide variety of sysadmins of different skill levels. For the love of your chosen profession, verify that the patches are in a pre-mounted and accessible share. You don’t want to be troubleshooting NFS during your down time.
Simply put, minimize distractions and work quickly. It’s perfectly acceptable to tell Flo the office lady that you’re too busy to talk—just because you’re at the office after hours doesn’t mean it’s your spare time. Close your door and lock it.
Aside from working quickly to get your changes done, the balance of the time can be used to verify the correctness and completeness of the changes.
The final step of the strategy is follow-up. This is purely an administrative step in the process. Close the file on your change management system, write a follow-up email, including a narrative of the change, any problems encountered, and each respective solution.
The narrative is a powerful customer service tool that provides your users with background and analysis. Narratives also give your customers an inside track. You shouldn’t expect feedback from a narrative, but do consider the opinions and observations of well intentioned customers who do elect to respond. Building a rapport is invaluable.
Do the Downtime
Downtime is a big part of what we do as systems administrators and downtime doesn’t have to be painful. A well-thought out, well-planned, and well-executed downtime is part of a job well done.
This same mindful strategy can be applied retroactively to emergency downtimes. Using the steps in this strategy helps you document what solutions you employ to solve problems on the fly and helps keep your users informed. Informed users are (usually) happy users.