How to Maintain 1,000 Linux Servers

Linux promises extreme flexibility and cost savings. But many enterprises are struggling to accommodate a growing number — hundreds, even thousands — of Linux servers. Learn how to take a new approach to administration that will simplify and streamline Linux management and configuration to keep large-scale deployment economical.

Over the past few years, Linux has changed the economics of the data center. In some data centers, where proprietary servers and operating systems have been replaced by cheap commodity hardware and Linux, operational costs have been reduced as much as 50 percent. Lured by incredible savings, many organizations are on the verge of incorporating hundreds or thousands of servers into the data center, making Linux the fastest-growing server operating system, easily surpassing the rate of Windows Server deployments.

But success has come at a price: as inexpensive servers multiply in number, growing complexity, administrative burden, and overhead costs undermine the savings and virtues of commodity computing. Moreover, existing system management tools and methodologies originally designed to manage a small number of (largely homogeneous) large-scale servers are no longer adequate.

It’s time to think differently about managing servers. It’s time to think outside of the big black box.

Modern data centers face an unlikely paradox: You can replace monolithic, expensive, and proprietary servers with inexpensive commodity Linux servers, but you won’t save any money. Instead of pocketing the savings, you’ll spend it on additional staff, additional maintenance, and additional infrastructure.

For example, Linux servers are far more heterogeneous than the “big iron” they typically replace. Commodity computing means different motherboards, varied processors, and numerous flavors of Linux. Hence, as the number of Linux servers grows, the complexity of managing such diversity increases exponentially.

And as any IT manager can tell you, as complexity grows, so do costs. According to an article published in November 2004 in Computer Weekly, “Gartner has warned that the total cost of [Linux] ownership is not likely to remain an endless treasure as system complexity increases.”

But don’t toss the Penguin out in the cold just yet. It may sound radical, but to maintain 1,000 Linux servers — not an unrealistic number of machines — you don’t need more people. You need to “think different.”

1. Think Differently About Administration

To grow your Linux server farm without growing your staff, it’s necessary to move away from the procedural approaches that grew up in the last few decades. While the procedural approach is familiar and appropriate for small and ad-hoc situations, it’s not well-suited for large enterprise Linux deployments.

Procedural approaches, whether manual or automated, proliferate with each new server, each new version of software, and each new variation in function, becoming ever more complex as the number of servers runs into the hundreds and thousands. Faced with populating hundreds or thousands of diverse servers with the software stack, scripting becomes impractical, along with the very large army of administrators otherwise required to individually reconfigure each server.

To run a practical and cost-effective data center, simplifying administration is key. Automating tasks or increasing procedural speed is no longer as important as the outcome itself: the state.

2. Manage State

In a new approach to administration, one can specify a desired end state and identify the transactions required to achieve that state. Putting servers under this type of control allows IT management to think of administration as an operational machine with a set of state transitions. By reframing the entire data center as a problem of state manipulation, administration tasks become discrete transitions from one state to another.

So, instead of running a series of command scripts, the system simply enacts a state transformation. Hundreds of diverse and dissimilar servers at a time may participate in such transformations, so even the largest data centers can be efficiently managed, and made highly scalable.

To manage state, it is necessary to identify common, exact intrinsic operations, such as software added, software removed, software updated, new hardware brought in, old hardware taken out, hardware breaking, and so on. Once you’ve classified these intrinsic operations, you can authoritatively manage state.

For example, consider the telecommunications industry. Over 100 years old, the industry has had plenty of time to identify its core intrinsic operations, known as Move Add Change Delete (MACD). Communications service providers can track these four simple intrinsics to define every single operation in a complex network, improving order management and data center efficiency.

There are probably less than ten intrinsics of state manipulation within the modern data center, with hardware, middleware/operating system, and applications/services among them. In a state-managed data center, you can articulate all these states, and relate and extract the information. In addition, you can detect any issues and have the ability to ultimately “roll back” state. Couple state articulation with speed and whole new possibilities open up in the data center for change control, dynamic repurposing, on-demand computing, resource utilization, and scalability.

3. Achieve Positive Scalability

A data center is scalable if it can expand the critical services it provides to its customers. After all, service justifies the existence of the data center and all of the infrastructure required to operate it. The amount of infrastructure itself — how many people, how many machines — is irrelevant.

For the data center, positive scalability means that the more servers under management, the better. Additional machines and additional infrastructure in turn scale the business or service provided, and therefore affect revenues.

A state-based approach to management creates positive scalability. Positive scalability breaks the old procedural cycle of rising cost and complexity, even as IT management struggles to bring hardware and software costs down.

4. Virtualize Server Hardware and Software

To separate the state of a server from the physical platform on which it exists, one must not only create abstractions for the hardware (platform virtualization) and the software (storage virtualization), but also be able to rapidly construct and deconstruct servers by manipulating the logical-to-physical mappings of their populated subcomponents.

What’s needed in the modern data center is the ability to quickly repurpose real platforms with the same flexibility and rapidity as one would use to create, hibernate, and reactivate virtual machines, but without actually paying for the overhead of virtual machines. This allows for static pools of fungible platforms that can quickly and cheaply be used for different and diverse services on demand.

For example, as web traffic peaks at a certain time of the day, you can fire up additional web servers to maintain satisfactory average response time. Machines that are not being used are repurposed, and later returned to their original tasks. Of course, if reallocation cannot be done quickly — even dynamically in response to traffic — it’s of no use.

5. Leverage Shared Storage

The modern data center can be viewed as a system and architected to simplify state management of its individual building blocks: hardware, software, applications, and so on. In this view, it’s important to detach the relationship between state and each platform, so hardware and software can be easily managed and swapped out, while the service is kept up and running.

Shared storage is the key to putting state control (logically) in one place. An image-based approach only achieves 1-to-1 correlation, but leveraging the concept of a repository for shared storage of common data achieves a 1-to-many correlation.

Look at all the different software in your system, and determine what is common so that you can have one copy of shared data to limit complexity. Reduce the number of discrete states to as much shared state (components like applications or operating systems, for example, may be common across multiple machines) as possible. Keep hardware independent from state to limit risk. That way, if one machine dies, the state will be preserved and any other machine can step in.

You also want to centralize data to minimize storage, reduce overhead, and enable certain capabilities and sharing. Having one central copy, whether virtually or logically, adds speed, and if it’s distributed with failover capabilities, manages risk if a machine goes down. One copy also means that you can quickly articulate and track any changes in state.

6. Address Security and Compliance

Traditionally, security has been approached at the edge of the network. But when the front line of defense fails, there can be serious consequences to the enterprise. Security breaches cost over $7 billion in lost productivity every year, wreaking havoc in the data center, where short- and long-term problems can be difficult to diagnose.

With authoritative state articulation, however, it’s possible to precisely identify what’s changed and detail the nature and scope of the changes. Without wasting time or affecting service levels, you can halt further spread across 1,000 plus machines, and roll back to a secure state in minutes.

The scaled-out data center needs to be able to rapidly deploy security patches and upgrades to its large number of servers and applications within a few hours, with the ability to safely and quickly undo a failed patch from production servers, without impacting mission-critical services. And more fundamentally, the data center must employ many layers of security and take substantial steps to protect the configuration files of all of the servers from unauthorized modification.

Security in the scaled-out data center encompasses the hardware, applications, configurations and environments, and should allow for reporting and auditing to meet current or future compliance requirements. For a number of compliance-related reasons, a data center is expected to preserve a service, its data, and its environment for up to 7 years, sometimes longer, with the ability to precisely and comprehensively restore all of it at any point during that time.

7. Enable Granular Roll-Back

As the data center is subject to perpetual change — through provisioning new hardware and software, through reconfigurations or administrative actions, or even accidents and malicious attacks — the ability to safely and quickly revert back to a prior state is critical.

For example, consider the case of applying even one small patch in a large data center. Due to the sheer number of machines, the impact of that one patch, especially if it’s faulty, is immense.

Traditional backups are hardly a solution if no backup was done before the change was deployed. And recovering files from backups can be an adventure, as restoration must first be done from the last full backup, followed by the sequential restoration from each incremental backup up to the last one, where the latter may have been created some time before a series of changes were made. Disk images are similarly problematic: restoring an image when a change needs to be reverted is far too coarse-grained to be practical. The act of undoing one change has the effect of undoing all changes that transpired between two snapshots, including those made by others who may be unaware or unprepared for this “corrective” action.

The scaled-out data center needs to have a granular roll-back mechanism that not only allows for the examination of precise additions, deletions and modifications that have been made deliberately or inadvertently, but also allows each individual change to be selectively rolled back, singly or collectively with other changes, and not just within a single server, but also among a group of servers, or all servers in the data center.

8. Make Disaster Recovery Part of the Plan Now

A typical data center is subject to periodic hardware failures. When the number of machines is small, it’s possible to maintain redundant units to take over when the primary unit fails. However, in a scaled-out data center with thousands of servers, it’s impractical and costly to have a standby server for each of the multitude of servers.

Today’s data center needs the same rapid failover capabilities as those provided by hot standby systems, without the cost and complexity. In discretely separating the server’s hardware from the contents of its storage, you onlu need to keep only a small number of spare servers in reserve, to which a service could quickly be migrated when the original host failed. Such a scheme requires that the entirety of the system — the application or service, configurations, and all data — be swiftly migrated to another piece of hardware, which may or may not have identical components to the original, in a matter of minutes.

Disaster recovery plans are critical with ten or with a thousand servers, but as the number of machines grows, there are sigificantly greater permutations and complexity. According to research from analyst firm Gartner, your disaster recovery plan must be capable of recovering the correct applications to a secure state, allowing the server to be recreated with all changes on a different platform or even location to continue providing service even as disaster strikes.

You Don’t Need 1,000 SA’s to Manage 1,000 Servers.

Today, 80 percent of the IT budget goes to simply “keeping the lights on,” ensuring that the data center is up and running and addressing unplanned outages, often when it may be too late. That leaves only 20% of the budget for exploring new IT projects that help lines of business, increase operational efficiency, or drive revenue. Get your expanding Linux servers under control, and it may be possible to reverse these figures.

This doesn’t mean the system administrator’s function becomes obsolete, far from it. Instead, a new approach to administration actually empowers the SA, DBA and entire IT team to move beyond simply maintaining machines and day-to-day infrastructure to increase productivity and more creatively contribute value that directly affects new lines of business.

The trick is to combine everything into an on-demand configuration with full state capture, portability, and disaster recovery using commodity hardware. By thinking differently about administration and authoritatively articulating state, you are well on your way to a kind of data center “nirvana,” where you can manage 1,000 (or more) Linux servers as easily as a handful, with lots of extra time to brew yourself a strong cup of coffee.

Comments are closed.