Cluster Administration is possible and even easy if you focus on the basics.
Recently, I came across an article entitled Lazy Linux: 11 secrets for lazy cluster admins. Most cluster admins who read it will probably chuckle and nod at the authors insights. There is nothing like experience to help manage a Smoothly Operating Beowulf (SOB — what did you think it meant?)
One of the often cited difficulties in this market is finding experienced cluster administrators. Indeed, this is considered as a “hold back” to fielding a production cluster. It has always been my position that clusters are different, but not that different than administrating the common Linux server.
Indeed, much of cluster administration is just really good systems administration i.e. you need a very good understanding of certain aspects of server administration. Most cluster administrators usually have some “carry over” from other areas of computing. Each of these areas have some level of an educational infrastructure (manuals, mailing lists, freely available software, and even courses) that can be leveraged by those wanting to learn about clusters. The following is a list of topics that, from my experience, are needed to become a Lazy SOB Administrator. Surprisingly, resources can be found to support almost all the areas.
Message Passing Interface (MPI): This topic is often new to administrators, but because MPI has been around before clusters hit the big time, there are numerous books and classes that facilitate learning MPI.
Compilers: Most cluster experts have a good understanding of compilers and building code. Understanding that the long stream of error messages can be due to missing library (and easily fixed) prevents the sense of overwhelm that comes with trying to build that new software package in your environment.
Operating System Administration: Opportunities to learn about operating systems are plentiful. Three inch thick books are in good supply as well as certification classes and training. Scaling good administration is addressed in the above mentioned Lazy Linux: 11 secrets for lazy cluster admins.
Commodity Hardware: Most clusters use off-the-shelf hardware. Resources for understanding commodity hardware are also plentiful. Although nothing works like have a motherboard or two with which to test ideas.
Schedulers: Resource scheduling has been around ever since people started sharing computers. There are resources to help learn about schedulers and like most things, a little hands-on time does wonders.
Networking: Networking is perhaps the toughest area to find good information — even in cluster courses. For many other cases non-optimal network performance works quite well for just browsing the web or transferring a file. Although much of Linux networking is plug-and-play, there is room for optimization when it comes to clusters. High end interconnect networks have in the past been even more obscure. Fortunately the market seems to be focusing on either 10 GigE or InfiniBand solutions and many of the high end network companies are moving in this direction as well.
Parallel Computing: This topic is perhaps the least understood of all the topics. It has been studied for quite a while and there is still no consensus on the best way to use multiple processors. For system administrators, parallel computing is often about removing bottlenecks that slow down program execution. These bottlenecks can involve one or many of the above topic areas. The multifaceted nature of these issues is why solving cluster problems requires good administrative practices — i.e. in my experience you can’t point and click your way though cluster optimizations.
In summary, the bulk of knowledge for good administration is “already out there.” Clusters use this knowledge in a very focused way and experience is the best teacher. Second to that, an open infrastructure that facilitates open discussion, co-development, and open problem solving is perhaps the second best teacher we have.
I could write more on this topic as it also begs the education issue as well. But, we will have to wait because the SC08 bus is almost here. For those that missed last weeks column don’t forget to come by the Beowulf Bash Party on Monday Night. In addition, I just got word that the camera guy will with me again this year. In case you did not see us, last year able camera man Vincent Hong followed me around the show floor while I interviewed participants and vendors alike. Amazingly, the “give Doug a mike and follow him with a camera idea” seemed to work. We are still working on getting some to this video posted, but expect so see a lot more video on the site real soon. They tell me I have a non-standard approach, but I generally get the point across. This year if you see me, run and don’t forget to smile.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.