Enterprise-level clusters? It's time we start thinking about them.
The Cry from the Wilderness
For whatever reason lately I’ve become sensitive to getting clusters to be considered “Enterprise-ready.” I’ve been talking to various people who either have clusters that are managed outside the mainstream Enterprise or who are new to clusters and want to find a way to get the clusters into the IT shop (so they don’t have to manage them). But a common theme from both sides is that Cluster Tools are missing what be termed “Enterprise Class” tools.
So how do we make existing Cluster tools into Enterprise-Class Cluster Tools? One approach is to add the pieces that are missing. I think there are some pieces that can easily be added to existing tools or just better installation and management tools to make them more Enterprise ready. However, I think there are some tools that need to be developed. Regardless let’s look at some things that are needed to move Cluster Tools into the Enterprise realm.
Bare Metal Backup
One of the key components that the Enterprise IT groups require in servers is a bare metal backup. This can be thought of a Disaster Recovery (DR) requirement. The idea is that if the server goes down and either a new OS drive is required or a complete new server is required, you can quickly restore it to the originally configured state without having to work through the installation, configuration, etc. that you originally had to do. It’s fairly easy to see the usefulness of a bare metal backup.
A bare metal backup is a bit different than an image. With an image you will need the same drive(s) as you had before. With a backup you don’t necessarily have to have the same drives as before. They have to be at least the same size, but not necessarily the same type or the same capacity. However, to make life easier it’s probably a good idea to have the drive(s) be the same type as before.
A bare metal backup for a cluster only applies to the critical components of a cluster. This includes the head node, any login nodes, IO nodes, and possibly storage nodes (for example, IO nodes for a parallel file system such as Lustre, GPFS, or GFS). Basically you create bare metal backups for anything but the compute nodes. Typically people think of the compute nodes as disposable. That is, if a compute node fails you just replace it and restart the job that was using the node. This also means that you don’t place anything critical on the node. For example, if you put the OS or a subset of the OS in a ramdisk in the compute node then if there is a hard drive in the node, it can be used for local storage or for swap space.
Using bare metal backups in practice is fairly easy, but as will all backups, requires some discipline. When a cluster is first configured and tested, but perhaps before it goes into full production, the first bare metal backup should be made of the head node, the login nodes, and possibly storage nodes (not the storage itself but the OS on the storage nodes). Then if any configuration changes are made to any of these nodes, a new bare metal backup is made of the entire set of nodes. These backups should be get a safe place and even better, extra copies are made and kept off-site (a simple safety deposit box can be used for this).
To make the process of creating bare metal backups easier it should be integrated with the cluster management tool. Then when a configuration change is made a bare metal backup can be created. This is also true for any firmware or BIOS upgrades. However, you will need to keep track of the current firmware versions so any replacement hardware is configured with identical firmware versions. In addition, if the image for the nodes is changed, the cluster management tool can prompt the administrator to make the backup.
One of my favorite tools for creating a bare metal backup tool called Mondorescue (www.mondorescue.org). It’s a GPL licensed bare metal backup tool that is also scriptable. It has a variety of ways to store the backups including NFS and is very easy to use.
Another enterprise class tool that is absent from virtually every cluster that I know of is a reporting tool. A reporting tool takes information about the system and creates a summary report. This report is very useful to get an idea at a reasonable high level but with some detail, as to how the system was used. The report can be a file or a delivered via the web or embedded in applications. it can include tables (including a spreadsheet), or graphs.
These reports can be used for visualization and reporting on the cluster. For example, it can contain information such as the number of jobs run in a certain time period, total uptime of the nodes, number of nodes down, the total down time for the cluster (sum of the down time of the nodes that were down), file space usage, log in times, administration time, any security information, such as failed login attempts, on the head node, any network traffic information you have gathered, and just about anything else you can measure or are interested in seeing a report.
I’m sure there are some people who are reading this column and asking if I’m on crack. Well the obvious answer is no and it’s not the coffee either. Let me explain why I think reports are a very useful and really a required feature for enterprise class clusters.
I think clusters are cool in and off themselves, but they are very useful for solving large scale problems. Many times, people get a cluster up and running and declare victory and start using it. However, as they will find out, that getting it up and running is only half the problem. The other half is keeping the system running. Simple day to day activities on the cluster such as watching the queues, killing hung jobs, starting jobs that need to be started, juggling project priorities, adding users, removing users, recovering or restoring data, debugging nodes that are down, bringing up repaired nodes, restarting hung nodes, and fixing user problems (“why won’t my script work?”) can take a great deal of time, particularly when you have a number of users with varied experience running on HPC class systems. Good administrators will fix these problems but the better administrators will try to determine why they happened in the first place and how to make sure they don’t happen again. The best administrators won’t necessarily through up barriers to prevent users from doing things but will give users better tools and better education to help them solve their problems.
If an administrator had a way to summarize activity on the cluster and correlate that with problems, then it’s much easier to see cause and effect and take corrective action. I know this sounds awfully “enterprise” like and some people are cringing and wanting to strangle me, but the fact of the matter is that having this information can make it easier to understand what is happening in the cluster and fix the problem. The enterprise world uses the phrase “business intelligence.” While I’m not hot on that phrase (it does have an oxymoron ring to it), it does hint at what you’re trying to do — more efficient use of the resources and making your cluster run more smoothly.