x
Loading
 Loading
Hello, Guest | Login | Register

Cluster Monitoring with Ganglia

Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.

Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.

Perhaps even more important than collecting long-term statistics is the need — for both system administrators and cluster users — for real-time resource utilization data. If jobs are queued and waiting to run while half the machine is idle, you may have a problem. A system administrator should be able to quickly determine this sort of condition.

Utilization information is valuable to cluster users as well. If a user needs to run a job immediately with as many processors as possible on nodes with specific memory requirements, he or she should be able to determine a runtime configuration based on present resource availability.

Ideally, what’s needed is a real-time system monitoring tool that can handle a large number of nodes, can store utilization data for long-term analyses, can present this information in an easy-to-read graphical form, and can do this without noticeably consuming precious CPU or network resources.

Enter Ganglia, a scalable, distributed monitoring package for high performance computing systems. Not only can Ganglia monitor a compute cluster (up to 2,000 nodes!), but it can provide…

Please log in to view this content.

Not Yet a Member?

Register with LinuxMagazine.com and get free access to the entire archive, including:

  • Hands-on Content
  • White Papers
  • Community Features
  • And more.
Already a Member?
Log in!
Username

Password

Remember me

Forgotten your password?
Forgotten your username?
Read More
  1. Scheduling HPC In The Cloud
  2. GP-GPUs: OpenCL Is Ready For The Heavy Lifting
  3. HPC Madness: March Is More Cores Month
  4. HPC Turn-Offs: Power Control
  5. The Cost to Play: CUDA Programming
Follow Linux Magazine
Rackspace