dcsimg

Lessons Learned, Again

Cluster training for both student and teacher. Plus, re-learning a big lesson the hard way.

Over the years I have made my share of cluster mistakes. Each problem presented an opportunity to learn something new, become a little smarter, get some scar tissue as it were. I had just such an opportunity this week as I as teaching Intermediate Beowulf: An Introduction to Benchmarking and Tuning as part of the ARC HPC Training at Georgetown University. I’ll get back to my teaching experience in minute, but first I wanted to talk about HPC education.

In one sense, discussing, the ARC HPC Training could be considered a shameless plug for trying to boost enrollment in the class I teach. (It is the best class ever — you really should sign-up for the next one, honest). However, the ARC Training is a much needed addition to the HPC market and community. When Jess Cannata and Arnie Miles first discussed teaching the course with me they explained the there would be a whole series of courses staring with the basics and then focusing on the important areas of cluster administration and use. Indeed, in addition to the course I teach, they offer Beowulf Design, Planning, Building, and Administering (September 16-19, 2008 and January 26-29, 2009), Grid Engine Configuration and Administration, Intermediate Sun Grid Engine Configuration (October 21-23, 2008), and Introduction to Programming Accelerators and Coprocessors (March 9-11, 2009). Other courses are in the works as well. Having experienced the efforts of the ARC HPC team first hand, I can attest to the commitment and dedication of all those involved. Of course there are other HPC trainings efforts and we can still use more. HPC is moving faster than ever and education is one of the gatekeepers to market growth. (Did you get that vendors?) Speaking of education, that a good segue to my recent learning experience.

The course I teach is designed for system administrators or talented end users who want to go a little deeper into cluster performance and tuning. One of the things I try to emphasize is that HPC pushes and uses hardware differently than most other markets. For example, consider an GigE switch. They all do the same thing right? You just plug in the cables and everything just works. That is unless you bought the switch that makes a carrier pigeon network look fast when all the ports are running at full speed. I often make the statement, “You can’t build a cluster by consulting those glossy data sheets. The devil, as they say, is in the details. Assumptions can ruin your day and waste time and money”. Great advice, too bad I ignored most of it this past week.

In the course, we use “mini-clusters” that consistent of a head node and one worker node. Each node has an on on board NIC, and InfiniBand or Myrinet/10G card, an AMD dual core processor, memory and a hard drive. The head node also has a NIC for LAN access. Basic stuff really. This class was the third time I was teaching the course and the second time we were using this hardware. In terms of software, I decide to us my Fedora based Cluster RPMS. The packages included things like LAM/MPI, MPICH1,2, OPEN-MPI, the NAS Parallel Test Suite etc. My previous version was based on Fedora 6 and I used Warewulf (now Perceus) to provision the nodes. In preparing for the class, I thought, I would use my new, but un-tested Fedora 8 version. In particular, I knew the nodes would not boot, but it seemed like a similar problem I had fixed before with kernel ramdisk block sizes. No worries, as I jumped in the car and headed to Georgetown.

Shorty after I arrived, Jess and I were doing our standard “night-before-the-class-getting-everything-working-routine”. Admittedly it is usually a long night, but we always had the hardware working when the class started the next day. As we started to work on building the min-clusters, Jess mentioned that he purchased two Intel GigE NICs for each node so we could play with some of the driver parameters and see how they compared to the on-board NVidia NICS. Simple enough, we’ll just set the node Intel NICs to PXE boot and be on our way. As the nodes booted we waited for the Intel BIOS message to appear so we could set PXE booting. It never showed up. Checking the BIOS boot options, there was no indication the card was even present. Darn. Oh well, let’s solider on and use the Intel NIC for LAN and go back to using the on-board NVidia NIC for the mini-cluster. Great plan, except, the Intel NIC would not work. It showed link connectivity to the switch but ethtool told us “Link detected: no”. Odd, the cards should work, I used them before in other hardware with the exact same software. Not to worry, we just popped in some good old 100BT NICs and were up and running.

Time to boot the nodes. No joy. The little boot problem I assumed would be fixed in about ten minutes, turned into a big problem. After a few yawns, we decided, let’s just drop back to the Fedora 6 software we used last time. It was very late, and we could work with the head nodes on the first day in any case and have the clusters ready on the second day.

The old software worked as expected. Well almost. As part of the class we compare GigE to InfiniBand and Myricom 10G. We also run some tests using Open-MPI so we can see the differences due to the hardware. Everything worked in the previous class. Of course it should work this time. It didn’t. The Myrinet installed and worked fine except for a random mpirun issue. The IB software installed fine, the diagnostics worked, but the MPI programs got stuck. We were not sure what happened and as far as we could tell everything was the same as last time.

Even with these problems, we were still able to teach a successful course (we had the test data from the previous course). In addition, we managed to demonstrate one valuable lesson; Assumptions can ruin your day and waste time and money. One could conclude that we planned to demonstrate this lesson as part of the course thus enlightening the students with true experiential learning. Or, on the other hand, one could conclude that the bone-head instructor did not listen to his own advice and assumed things should just work because they did before. I’ll leave the final determination as an exercise for the student. In the mean time, I’ll be testing a few assumptions over the next few days. The next time the course is offered, we’ll present a case study on the importance of eliminating the words “should work” from your vocabulary. And, remember for those that missed it the first time (and second time) Assumptions can ruin your day and waste time and money. In my case, maybe a tattoo would help.

Comments on "Lessons Learned, Again"

notinhnotien7

Thanks for sharing your experiences. My question to you and ARC HPC Training are the followings:

1. Are you and others going to write books about all the topics that are taught at the training for those who simply could not attend these training?

2. If no books from you and others, could you recommend a systematic process to acquire all needed knowledges to be on top of administrating, designing HPC.

Thank you.

Reply
pbock

I second the requests from the message posted by notinhnotien7.

Unfortunately, from what I can see the list of needed knowledge is simply ‘*everything*’ (the *’s are to to include anything I left out with the word ‘everything’)

Pretty much everything from assembly level programming in BIOS all the way up to network administration, with all the little nitty gritty details in between.

You could maybe get a simple cluster working without all that knowledge, but you’ll have to learn all that stuff anyway when it breaks, or you switch out one piece of hardware like the article mentions, or you change something in the software stack, or click the left mouse button with the wrong finger, etc.

You also need to be able to test every component; is the problem with node 6 a bad NIC, or a BIOS setting, or did the penguin|cat|dog chew on your ethernet cable?

Reply
doubtful500

Response to notinhnotien7:

We think that there should be more publicly available information about building and running clusters, which is why all of the student material is available via our Wiki. We require all of our instructors to allow us to post their material, and the instructors–and the organizations for which they work–have graciously complied.

You can find the material for the different courses at https://www.middleware.georgetown.edu/confluence/display/HPCT/Home.

Response to pbock:

Yes, as a builder and maintainer of clusters one is required to know something of everything. Fortunately, things are getting easier on the administration side in part because there are now many hardware companies that will do the testing to make sure those NICs work with the motherboards on Linux.

Jess Cannata
Advanced Research Computing &
High Performance Computing Training
Georgetown University
jac67@georgetown.edu

Reply

The subsequent time I learn a weblog, I hope that it doesnt disappoint me as a lot as this one. I mean, I do know it was my choice to read, however I really thought youd have one thing interesting to say. All I hear is a bunch of whining about something that you could repair for those who werent too busy looking for attention.

Reply

I was just searching for this info for a while. After 6 hours of continuous Googleing, finally I got it in your website. I wonder what is the lack of Google strategy that do not rank this kind of informative web sites in top of the list. Usually the top web sites are full of garbage.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>