Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.
Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.
Perhaps even more important than collecting long-term statistics is the need — for both system administrators and cluster users — for real-time resource utilization data. If jobs are queued and waiting to run while half the machine is idle, you may have a problem. A system administrator should be able to quickly determine this sort of condition.
Utilization information is valuable to cluster users as well. If a user needs to run a job immediately with as many processors as possible on nodes with specific memory requirements, he or she should be able to determine a runtime configuration based on present resource availability.
Ideally, what’s needed is a real-time system monitoring tool that can handle a large number of nodes, can store utilization data for long-term analyses, can present this information in an easy-to-read graphical form, and can do this without noticeably consuming precious CPU or network resources.
Enter Ganglia, a scalable, distributed monitoring package for high performance computing systems. Not only can Ganglia monitor a compute cluster (up to 2,000 nodes!), but it can provide…
Please log in to view this content.
Not Yet a Member?
Register with LinuxMagazine.com and get free access to the entire archive, including: