Here is your challenge. You have a need for speed, your current computing power is insufficient for the task at hand. You have some large number of calculations to perform, and very little time to achieve this goal. Can you solve this problem? Cost effectively? Quickly?
Here is your challenge. You have a need for speed, but your current computing power is insufficient for the task at hand. You have some large number of calculations to perform, and very little time to achieve this goal. Can you solve this problem? Cost effectively? Quickly?
Get It Done… Yesterday
Getting a cluster up and running quickly is possible if the right steps are taken. As this article will show, it is possible to bring a real cluster up in less than 60 minutes! Before we begin, though, it is important to understand that component selection, network, storage, and node architecture are complex problems beyond the scope of this article. We assume you have your cluster hardware in place ready to run — except for the software.
A good HPC solutions company can help with the entire process, from design to acquisition and support, but chances are it will not install and support the software stack that fits your needs. Let’s assume you have worked with a vendor and have your cluster engineered to fit the problems. It is likely to be a 64 bit system based upon x86_64. Let’s also assume that the cluster has been physically assembled, or racked-and-stacked, if you prefer.
Now you have this system in house and you have to get it up and operational for users to start crunching. How can you do this quickly and efficiently? What problems are you likely to encounter, and how can you solve them? How soon will the users get access to the machine? Most want it yesterday. Let’s assume you are under the gun to get this done. Basically, how you can be up and running quickly?
Our goal is 60 minutes from bare metal to loading nodes and even running jobs. Before we get into this, please understand that there are no silver bullets, magic distributions, or anything like that. Pre-planning is important, as there are multiple gotchas that often trap the unwary and occasionally the experienced. Finally, there are no “instant clusters, just add nodes” systems for you to work with. Some groups are working on such things, so if you pay attention to you will likely see announcements of such things some time in the future. For now, you still have to build it — and that means planning and executing the plan. Time is a wasting, so let’s get started
We Need A Plan
A plan for making the system available to users involves some choices and some information you must gather. To meet the goal, total planning and decision timing should be on the order of minutes for this system. The information we need to gather is as follows:
- External Network Data: Typical cluster designs have one or more administrative nodes (head node, login node, data storage nodes, …). You will need external Fully Qualified Domain Name (FQDN) machine names and IP addresses for all of these you plan to make visible on the network (most cluster compute nodes do not need FQDN). Additionally, you will need netmask, gateway, routing, DNS servers, time servers, and related. You should also make sure you gather where the legitimate users will be coming from, as setting up access control via firewalls is generally a good idea. Most of these items are available from your local system administrator.
- User Authentication: How will users be authenticated? Are there existing LDAP/NIS/AD or Windows domains to authenticate against? Will you use a separate authentication regime in the cluster from the outside?
- External services: Will the nodes be mounting file systems from remote servers for home directories, applications, and other purposes? Will they be contacting external license servers?
It should take all of about 5 minutes to get the network data from your friendly neighborhood IT person. The user authentication issues can be altered later on if required, so if you have to guess immediately, start assuming a local authentication regime, make it work with that, and then move to adding additional authentication if needed. External services can be added after a cluster is up.
The next step is to architect the cluster internal network. There are few reasons to provide all compute nodes direct access to a WAN or LAN. It is generally a bad idea in that it requires extra effort to administer and secure. Any benefit such a design might provide needs to be compared against the cost of implementation and maintenance of this system. We often recommend a design that looks something like this for smaller clusters (less than 128 nodes):
Table One: IP Address Assignments
||where x = 1 .. 253
||There is only one of these
|file server node
||(if different than head node)
|gigabit switch 1
||and so on
||where r == rack number starting at 1,
and n specifies the node in that rack
As the table illustrates, a good assignment for a basic 128-node (or less) is the internal gateway (head node) at 10.1.0.1, the internal storage unit(s) at 10.1.0.6, and various other nodes such as login nodes at 10.1.0.10. Switches may need an address, so place them 10.1.0.128 and up. Note that some administrators are reluctant to use a 0 in an IP address. You are welcome to increment that 0 (and everything else there) by 1, or even place the administrative bits at 254. The purpose behind a consistent naming scheme is simply support. A rack number and a node number provide a nice coordinate system for your machines. Unfortunately computers and computer science-trained people like counting from “0″ and computer equipment such as switches and other things often start their counting at “1″. We advise that in the planning of this system that you perform all your counting consistently, and start your first node and rack at “1″ to avoid confusing people. It will work the other way, though you may discover that wiring the unit and troubleshooting later on are … more complicated … than they need to be. I don’t know too many switches with a port 0, though all have port 1.
Also recall that these IP addresses are non-routable (they will not work on the Internet) and designated for internal private network usage. Which, by the way, is perfect for our cluster.
This process should take about 10 minutes. For a simple, single head node multiple compute node cluster it’s straighforward. Let’s assume we have our IP address, our netmask, our gateway, our DNS and time servers, as well as our internal architecture. Is this enough to get started?
Almost. We have to decide on a Linux distribution.
The Distribution Decision
The decision about which Linux distribution should be made on the basis of application needs. Pre-existing commercial applications likely have pre-set supported distribution lists and should be consulted for more information. Open source applications, on the other hand, can generally be run on any Linux variant. Check the build tool requirements for specific packages however.
Within Linux you have commercial and open source cluster packages available to you. Some of the commercial packages are repacked versions of the open source systems with support layers. Others are designed from the ground up to address specific needs. This article will further focus upon open source offerings. The non-open source offerings include Scyld from Penguin Computing, as well as tool sets from Platform, Scali and others. They are worth looking considering, as they might save you time and effort. Unfortunately, some of these require application recompilation which may not be possible with closed source packages, so check compatibility before deploying a commercial solution.
Linux clusters are typically built from one of several Linux distributions, including Red Hat, SuSE, and Debian. Red Hat- and SuSE-based clusters enjoy wide-scale commercial application support. Generally, there are few if any issues using the CentOS Red hat rebuild or OpenSuSE. Be aware, however, that the OFED stack (Open Fabrics Enterprise Distribution) for Infiniband is an issue for OpenSuSE, though the author has successfully worked around the issues with a few modifications to the build scripts (and installing a new kernel).
Red Hat and Red Hat-alike distributions tend to use older kernels, and include non-Red Hat-derived elements. Specifically, unlike most of the other Linux distributions, Red Hat has chosen to omit support for the XFS and JFS file systems from the base. Some vendors load Fedora, though many consider Fedora as a beta test for Red Hat proper (this is changing to a degree). As such, you need to be careful in when using Fedora, as it is a moving target and is often not supported by software/hardware vendors. Every choice results in compromise. Remember, clusters are engineered solutions, compromise and flexibility are required.
Currently most of the cluster systems support Red Hat and variants. Some vendors do have SuSE clustering systems. For the sake of simplicity in the first version of the cluster installation, let’s assume you have software which is supported under Red Hat. Plus, this makes our use of the Rocks cluster distribution most fortuitous.
If the nodes have a default distribution loaded, chances are that the vendor loaded its favorite (read: lowest cost) variant. If they have done this for you, great, they may have saved you some time and effort. Unfortunately, vendors don’t often configure their clusters for cluster computing; they often just load a distribution, and possibly a job scheduler. It might take you more time to work through fixing their distribution than it would to just load a new one. Test their system, and see if you can compile and run a simple MPI application easily across the cluster. If not, a reload as described here may be advised.
We Made Our Choice
Let’s assume that you have gathered this information together and now you are ready to load a Red Hat based distribution, and you want to start working. Since we wanted to use Rocks in any case, this was kind of a forgone conclusion and requires no time (The above discussion of the distribution options is still important, however.) As part of the distribution decision, let’s assume that you don’t want to spend lots of time on installing a plain vanilla distribution and configuring it for cluster usage, which is why Rocks is a good choice.
With the Rocks 4.2.1 DVD in hand (See Sidebar), you are set for a simplified and fast cluster installation.
Finally, you need your applications. A cluster is a large fast entropy generator (e.g. unrecoverable waste heat generator) without applications. Some of the Rocks Rolls include pre-installed applications
Initial Boot-up of the Head Node
In many simplified cluster designs, a head node is your administrative, login, and scheduling node, usually with some file system service thrown in. Rocks makes these assumptions in its base configuration.
Connect a monitor, keyboard, mouse, and power to the head node. It may be helpful to have a second monitor and keyboard to connect to the compute nodes, just in case. Insert the Rocks DVD and boot the head node. (Of course you have checked the ISO MD5SUM against the web site). Rocks will start up and provide a graphical prompt. You need to type…
… to start the head node initialization. You will be presented with multiple screens which ask about the “Rolls” you wish to install (functional plugins for the cluster), as well as about the network information we have previously gathered. Pay careful attention to the IP address, naming, and other issues. Please note that you do need to have the correct IP address and naming. Due to the way Rocks works, it is quite difficult to change IP addresses, names, etc. Rather than doing this, it makes more sense to plan out the install using the final IP address, naming, networking information, so as to avoid a reloading of the head node. (Which is why we included the planning step above!)
This process should take about 20 minutes, depending upon the hardware. At the end of the standard install, it may take a seemingly long time to finalize the installation, as it performs configuration specific details. It will complete, and then the system will reboot. Remove the DVD, and let the system restart normally.
Your cluster head node is now up. Elapsed time, including information gathering, has been about 35 minutes. Now it’s time to install the compute nodes.
Log into your new head node using the root password you have selected. It will prompt you for ssh key phrase. Generally you do not need to enter one, though some schools of thought consider it a good idea.
Once in, you are ready to add compute nodes. Type…
# insert-ethers -cabinet 1 -rank 1
… at the command prompt. Select compute-nodes from the menu. Press Enter.
Turn on your first compute node. If the BIOS is not configured to PXE boot (Network boot) after trying local hard disk first, then you should make that change. Most BIOSes are configured to do this now. You want this node to PXE boot. If it doesn’t do it by default, and you cannot set the BIOS to do this for you, you can sometimes press a key to bring up a boot menu, and then select network boot. If that fails, try a USB DVD device, and boot the same DVD you used to load the head node. Some BIOSes give you the option of pressing a key to force a network boot. Use that if you need to (e.g. if it doesn’t PXE boot without some convincing).
|Obtaining the Software
Before you start the 60 minute clock, you will need to pull down The Rocks Cluster distribution and burn it on to a DVD.
We will be using Rocks 4.2.1 (this file is a large DVD ISO so it may take some time to download).
If there is an existing OS on the node hard disk, this may boot first and thus will not attempt PXE booting. It may be a good idea to place a monitor on the first compute node to make sure this is not happening. If it is, try adjusting the BIOS to PXE boot first. Take note of any issues with the first compute node because, the rest of them should behave in exactly the same way.
If all goes well, and we are assuming that the network is in place, Ethernet switches powered, and the compute node can send PXE packets out, and the head node can see them, then the first compute node should start installing.
We will discuss what to do if something goes wrong in a moment.
If that node starts installing, then you can do the same thing to the second, third … and so on in that first rack. Usually, when everything works as designed, if you power the nodes up, in order, you will get a sequential set of nodes. (i.e. the IP address will run sequentially from the first node started to the last. If you proceed in an orderly fashion, from bottom to top of the rack, then this will help resolving problems in the future.) If during your build process an Ethernet switch or other device decides to broadcast a DHCP request in the middle of the process things may get confusing.
When a mistake does happen, you will get off-by-one (or more) errors, so that the Nth node in the rack may be numbered N+1 (or worse). You can prevent this by assigning fixed addresses (remember that 10.1.0.x network for administrative machines mentioned previously?) to anything other than nodes. You may also just have started nodes in the wrong order.
Is this a problem? Perhaps it is more a cosmetic/aesthetic problem than functional problem, but in the general case where you subscribe the wrong node, there is a way to recover by removing the “extra” device. First, quit the running version of insert-ethers, and then enter…
# insert-ethers â€“remove the_name_of_the_wrong_compute_node
For example, if you have nodes 1 to 4 set up, and the switch suddenly broadcasts for a DHCP address, I will now have compute-1-5 being my switch. To correct this problem, quit insert-ethers and type:
# insert-ethers -remove compute-1-5
and then restart the insert-ethers from the 5th node by entering:
# insert-ethers -rack 1 -rank 5
By the way, rank in this case means “number of the compute node”.
When you get to the end of a rack full of machines, it is a good idea to stop insert-ethers, and restart it with a higher rack number. So if I finish out my first rack, and want to start on my second rack, I can enter:
# insert-ethers -rack 2 -rank 1
And so on.
Do you really need to do this? Not really as you can just turn them all on at once if you don’t care about name to location mappings, but at some point, it may become a time consuming issue when trying to locate a specific node.
Each machine will take 4-5 minutes to install, but we can do this in a pipeline fashion. Give yourself about 20 seconds between powering up machines making sure the machines will PXE boot initially. You can get a single rack done (40 systems) in about 12 minutes this way. Coupled with the 35 minutes for head node setup/config, we are at 47 minutes.
Are We There Yet?
Almost done. Still need to add users, and install applications.
Adding users in Rocks is simple:
# useradd name
# passwd name
To verify that it works, enter:
# su - name
You will be prompted for a phrase for your ssh key. Again, the phrase is optional. Just don’t forget it if you enter one, it is hard to recover.
Now, you should be able to ssh to a new compute node. For example:
[landman@minicc ~]$ ssh compute-1-1
Last login: Tue May 1 13:08:35 2007 from 10.1.0.1
And there we have it. Almost ready. Still need those applications. That is, unless you used the BIO (Computational Biology) roll from Rocks. You also have thirteen popular applications ready to run.
For specific applications, we generally advise installing applications in a separate directory tree, along the lines of…
For example, for AMBER 9 built with gcc for x86_64, we would use
Yes, it looks long, and if you don’t plan on having multiple ABIs (architectures), compilers, or versions around, you can likely shorten it a bit. We have found that many users like to have multiple versions of code around for testing purposes.
You will find a job queuing system, called Sun Grid Engine, installed by default. You can find Grid Engine documentation on-line at online. While we are talking documentation, you can find more documentation about the Rocks rolls on the ROCKS site. The documentation is also installed on your cluster. We don’t include reading documentation in our 60 minute exercise, but we do advise reading the .
You can run commands across the cluster using tentakel, or pdsh if you install the roll that supplies these. Running commands across the cluster is generally reserved for administrator and NOT users. All user jobs should be run the the job queuing system.
Voila. You now have an operational cluster. In less than 60 minutes. Ok, we hid the latency of the DVD download and a few other things. But you get the idea, this is a simplified cluster load. It should be noted that simplified cluster loading and administration has been available via Rocks, Warewulf, Oscar, and others for a while in the Linux space. Scyld, Scali and other tools make this quite simple from a management perspective. Anyone indicating that loading and managing a cluster in Linux is hard, might not be talking from direct personal experience. Rocks makes this process quite painless.
Rocks is a simplified installation system. It is extensible, though due to its dependence on Red Hat distributions, it has all the limitations of these distributions as well. Specifically what is most troubling to end users, is new hardware support. This is not a Rocks issue, but a Red Hat issue, and one which is hard to solve in Rocks. That is, without rebuilding the kernel roll to incorporate the new drivers. Assuming that new Red Hat drivers exist for the hardware. This solution is not always the case. Recent examples with some versions of the Intel ESB2 device used for networking have proved problematic. Another is XFS support, though there are RPMs and rolls available which can provide it.
Stop The Clock
The following table is our tally of time spent setting up our cluster. Not bad three minutes to spare. Of course your experience may take longer (or shorter!) depending on you hardware environment. The take away from this exercise is that installing a cluster software stack an be done quickly and easily and need not be a bottleneck when delivering high performance to your users. Now it is your turn, ready, set, go …
Table Two: Cluster Building Time Record
|Collect Network Data
|Design Internal Network
|Install Head Node
|Get a Cup of Coffee
Rocks is a trademark of the Regents of the University of California. All trademarks indicated or implied are properties of their respective owners.
runs a Scalable Informatics
, a high performance computing solutions company located in Canton Michigan. He has been working on HPC systems since 1986, and in a vendor/user role since 1995. He has worked at IBM Research, SGI/Cray, and MSC.Software before starting Scalable Informatics. He has a Ph.D. in computational physics, and has worked on algorithm acceleration and code tuning, parallelism, and lots of other things to squeeze speed out of systems, and reduce time to solution and insight.