Enterprise-level clusters? It's time we start thinking about them.
I wasn’t a believer at first about reports but once I saw what kind of reports were capable of being produced, I became a convert. While they may not be appropriate for small systems with just a couple of users, they are extremely useful for larger systems with a number of users. Even better, you can use these reports for your management to show them where bottlenecks are in the cluster, how much administration time is being spent on certain aspects (e.g. killing hung jobs), and showing how overloaded the system is, how reliable the cluster is (or isn’t). Then you have grounds for arguing for a new, larger, and faster cluster. With reports in your hands you have firm evidence to justify your arguments for that shiny new cluster. Furthermore, you can also post the reports to the web to show users how the system is being used so understand the need to be patient (in some cases) because they can easily see the number of jobs waiting in the queue is huge. This allows gives them the opportunity to complain to their management which may help free up some funding for new systems.
The enterprise class folks have had a robust reporting capability for many years. The preeminent enterprise report tool that I know of is Crystal Reports. It’s widely used for report generation. I suggest that you take a look at what it can do as well as examples of reports. I’m not sure if Crystal Reports can work with clusters but the examples will give you an idea of what’s possible.
There are open-source tools that also do reporting such as data vision (http://datavision.sourceforge.net/). I’m not sure how Datavision or others compare to Crystal Reports but I’m sure they have some reasonably good basic reporting capability.
Even more important than what reporting tool is used is that it is integrated with the cluster management tool. By doing this is can be pull data from the cluster tool, assuming that it is doing monitoring of the cluster and the job scheduler. In addition, a default report should be included so administrators don’t have to create something from scratch.
Directory Services
After the cluster has been brought up, it’s time to create accounts for users and have them start running jobs. However, one of the key points is creating user accounts. There are ways to do this:
- Create accounts on the head node that are local to the cluster and than copy the password files to the compute nodes (this approach is sometimes called “files”)
- Use NIS , NIS+, LDAP, or even Active Directory within the cluster but keep the account local to the cluster
- Use NIS, NIS+, LDAP, or Active Directory where the accounts are defined outside of the cluster and use the same account definitions within the cluster
Each of these approaches has it’s pluses and minuses and there is lots of debate about what is the best approach (I have my own opinion but that’s another column). But in many cases there is a need to use one of the approaches because of IT requirements (yes I hear the grumbling). While sometimes a pain to accommodate IT’s requirements, in many cases it has to be done. So the question then becomes configuring the head node for NIS/NIS+, LDAP, or Active Directory, and the configuring the compute nodes to contact the head node for directory services.
For a cluster it is unlikely that you will be using anything other than user account from directory services. So it would be nice that during installation of the head node, it was simple to choose what option the administrator wants (i.e. “files”, NIS, NIS+, LDAP, Active Directory) and walks them through connecting the head node to the desired directory service. Then as part of the creation of the node image (if you images) or the set of packages that are going to be installed on the compute nodes, you should also have the ability to choose a directory service
Given all of this what an enterprise class cluster tool set needs is an easy way to have the head integrate with the desired directory services, and have the compute nodes either use the same service or just “files” for authentication. Ideally this would be part of the installation of the OS on the head node or as part of the cluster tool installation. Plus the initial compute node image or set of packages would have the appropriate directory service installed. I know this may be a bit of heresy, but I think it’s important for clusters to be considered enterprise class.
NIS, NIS+, and LDAP all come with Linux as part of virtually any OS you might install. Active Directory is a bit more difficult. There are commercial offerings such as those from Likewise Software or Quest Software’s Vintela package both of which are commercial packages. I have also seen articles on using Windbind (comes with Samba) to do authentication against Active Directory. However, I’m not sure it’s easy.
While I still think cluster tools need to be “upgraded” to configure user authentication against a variety of directory services I think connecting a Linux cluster to Active Directory is perhaps not the easiest thing to do or perhaps the smartest thing to do. However, I think we as a community should start to figure out how to do this since it is coming (beware the Enterprise Standards).
Lots of Other Things
I have several other ideas for things that could help turn clusters into enterprise class systems. For example, having an image that could be used to benchmark a node would be very useful. it would allow the administrator to periodically test the compute nodes to make sure that performance is degrading, particularly after upgrades. It can also be used to check the performance across the compute nodes to look for variance in performance.
Forward to the Future
I’m likely to get a few emails about this column. It is somewhat controversial to talk about enterprise level systems in the cluster community. To some degree I agree with this sentiment because clusters may need to do things in an unconventional manner for the best performance and clusters are about performance. But at the same time, I see the need to bring clusters as best we can into the enterprise. It can lower the barrier to HPC for customers who have no experience with it and can possibly help reduce costs associated with clusters. So I think it’s time we start thinking about these things and working on them as quickly as we can.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).
Comments on "Cluster Tools for the Enterprise"
Surprised you didn’t touch on the Ganglia open source project – used for measuring, monitoring, maintaining, and optimizing grid and cluster HPC environments. Beyond academia and research, there are now major enterprise HPC environments being managed using open platforms and open standards like Ganglia.
What is your take on integrating with existing enterprise monitoring frameworks through SNMP. I am not very familiar with these but HP OpenView or BMC Patrol come to my mind.
Why doesn’t Dell adopt Open Solaris and dump that linux and M$ junk.
Sun has had a very nice suite of clustering software for years, now they are pseudo-open sourcing it. Also for old schooler clustering, don’t forget Beowolf.
Insider28 is right about Ganglia. arenddittmer must have meant BMC Performance Manager: a good product, it supports Veritas Cluster, SUN and M$ clusters.
I split monitoring into three categories. Each one requires a separate tool(s):
1) user/job monitoring
- who is using the cluster ? how do they use it ? who are the cluster “hogs” ? is your cluster efficiently allocating resources ?
2) performance monitoring
- Is your network congested ? does your storage backend have evenly distributed load ?
3) hardware monitoring
- raid array degraded ? did a network switch just reboot ? bad memory on a cluster node ?
#1 should be the job of the cluster scheduling/grid engine software
#2 is would be something like ganglia previously mentioned.
#3 would be a SNMP aware tool: cacti, nagios, opennms, etc. Many hardware vendors have linux-aware products in this category. (eg. Dell, IBM, HP).
We started with #2, but discovered it wasn’t answering many of these questions.