Caos NSA and Perceus: All-in-one Cluster Software Stack

Silence the struggle around cluster software stack configuration. Caos NSA is a distribution that focuses on making things simple, easy to install and upgrade, and easy to manage.

Perceus Configuration

  1. Select network for cluster (eth2)
  2. Define the IP range for nodes (10.1.1.2 to 10.1.1.252)
  3. What number should the first node start with (1)
  4. Perceus registration

After step 2, the installation configures Perceus. It even configures and generates ssh keys (what a nice thing to do). Perceus registration is also optional, but if you don’t input something, you can’t continue (actually this is a feature and I’ll explain why in a bit).

After Perceus was installed, Sidekick asked if I wanted to check for updates. It found a few updates, all of them for the desktop. It asks if you want to install the updates or not (looks like yum to me) and then does a little housekeeping. During this housekeeping phase it erases the installation packages and does a few other things including setting up ntp for the cluster (if you run a cluster, you need to run ntp).

At this point, the master node was ready to go. I needed to grab a VNFS capsule so I used wget to pull down a premade capsule for Caos NSA

# mkdir CAPSULES
# cd CAPSULES
# wget -v -c http://mirror.caoslinux.org/Caos NSA-1.0/vnfs \
 /x86_64/Caos NSA-node-0.9-30-1.stateless.x86_64.vnfs

Once the capsule is downloaded you need to “import” it, so Perceus knows about.

# perceus import vnfs \
 /root/CAPSULES/caos-nsa-node-0.9-30-1.stateless.x86_64.vnfs

During the importing process, Perceus will ask you a few quick questions. For example, it asks what root password I wanted to use for the capsule, which ethernet device was to be used for booting, and what the address was of the machine holding the capsules. After a few minutes, I was able to check if the capsule had been imported,

# perceus vnfs list
caos-nsa-node-0.9-30-1.stateless.x86_64
#

At this point Perceus knows about the capsule I wanted to use.

Everything is ready to go and we can start booting compute nodes. I then booted my first compute node with a monitor plugge and keyboard plugged into the node. I saw the node grab the Perceus OS via DHCP. It then said, that no VNFS image had been defined for node n00001. This makes perfect sense since Perceus didn’t know anything about this compute node at this point so it didn’t know which capsule I wanted it to use. I then told Perceus I wanted to use a specific capsule:

# perceus node set vnfs \
 caos-nsa-node-0.9-30-1.stateless.x86_64 n0000[1-9]

This command tells Perceus to use the particular VNFS image on all nodes n00000, n00001, …, n00009. At this point, Perceus told n00001 about its VNFS image and sent it to the node. The next thing I know, the node is up and running!

I could easily check that the compute node was up by,

# ssh n00001

as root. If it succeeded the node was up and ready. By default, Caos NSA configures Perceus to NFS export /usr/local, /opt, /home, /usr/cports, /srv, and /var/lib/perceus/ from the master node and mounts them on the compute nodes.

The Cluster is Up – Now What?

At this point people may say, “you’ve got the cluster up but it’s not running jobs yet.” You are correct. So, let’s rectify that situation. We will need a compiler (C and Fortran), an MPI that is built to use the compilers, and a job scheduler.

Caos NSA installs a compiler suite, gcc-4.1.2, be default as well as openmpi.

# gcc -v
gcc version 4.1.2
# gfortran -v
gcc version 4.1.2
# rpm -qa | grep -i mpi
openmpi-devel-1.2.4-3.caos.x86_64
openmpi-runtime-1.2.4-3.caos.x86_64
#

Caos NSA installs these packages be default when installing Perceus. Even better – Caos NSA installs environment modules, commonly just called modules in the cluster world, with Perceus. This column is too short to explain what modules can do for you, but if ever want to use more than MPI library, more then compiler or version, more than version of an application, then modules is what you need. It solves so many problems. Just google for “environment modules” and the first hit should be the correct website (modules.sourceforge.net).

If we run the command, “modules avail” we will see all of the modules that are preconfigured with Perceus.

----------------- /usr/Modules/modulefiles -----------------
dot      module-info modules       null       use.own
------------------------- /etc/modulefiles -----------------
openmpi/1.2.4

I don’t want to discuss the in’s and out’s of environment modules.
But we can easily “load” the openmpi module with the command:

# module load openmpi/1.2.4
# module list
Currently Loaded ModuleFiles:
 1) null          2) openmpi/1.2.4

So we have an MPI library ready to go, what we need now is a job scheduler. Perceus doesn’t install one by default, but the Caos NSA team has packaged one that you can easily install called Slurm (trust me – just google it). You can use Sidekick to install it by the starting up Sidekick at the command prompt and selecting “Services” and then scrolling down and selecting “Slurm.”

Slurm asks you if it should be installed as as control service (yes) and then asks the names of the nodes that should be used. In my case the nodes I wanted to use were home64, n00001 (I only had two nodes at this point including the master node which is home64). Then the installation gives you directions on how to install Slurm in the VNFS capsule since you will need some parts fo Slurm on the compute node. The instructions are easy (write them down).

Slurm Compute Node instructions

# perceus vnfs mount [vnfs name]
# cp -ra /etc/slurm/* /mnt/[vnfs name]/etc/slurm
# perceus vnfs umount [vnfs name]

It took about 1 min to install Slurm on the master node and about 2 mins. to update the VNFS capsule. Then I just rebooted the compute node and it had slurm on-board.

How Long Did it Take?

Since Caos NSA is really an all-in-one cluster kit, I was curious how long it took me to install it and get a cluster up and running. So I timed myself to install Caos NSA and get Perceus going. Here is the time

Table for Installation Times

Step Time
Install Caos NSA including Perceus 10 mins.
Download capsule using wget 4 mins.
Time to import capsule 4 mins.
Time to boot node and send capsule 1 min.
Time to install Slurm on master node 1 min.
Time to rebuild capsule with Slurm 2 mins.
Time to reboot the compute node 1 min.
Total Time 23 mins.

I don’t know about everyone else but 23 mins. to do a complete master node installation with a firewall, cluster management installation and configuration, job scheduler installation and configuration, and getting the first node booted is the fastest I’ve ever seen! I know Joe Landman has talked about installing Rocks in 60 minutes. I will admit that I ran through the installation one time to make sure I had all of my hardware correct, but I didn’t actually configure the cluster during the first run. So 23 mins. is pretty close to the time a first time installation will take.

Simple as Pie

So I now have a functioning Caos NSA/Perceus cluster in 23 minutes and that includes building the master node, downloading the capsules, and booting the first compute node. That’s pretty darn fast in my opinion. Plus I now have a Perceus configuration where I can boot as many compute nodes as I want as well as a job scheduler that is up and running.

I highly suggest you give Caos NSA and Perceus a try if you have a cluster you are bringing up. It’s rather easy and even, dare I say, fun. Plus the Caos NSA distribution doesn’t get in the way of things and contains the major things I need for building a stable cluster.

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "Caos NSA and Perceus: All-in-one Cluster Software Stack"

spichardo

After the excellent experience of using Warewulf in a previous job (as post-doctoral fellow), I did not have any second thought to deploy Perceus for my new cluster, which is a diskless cluster with 16 nodes. The Caos NSA distribution was not ready yet at that time so I stayed with Centos 5.2, which to be honest I don’t have any inconvenient at all, much the contrary.

Even if I found Perceus much simpler to administer (one command to rule them all, well almost :) I was a little confused by the fact that some very basic modules (as the ones that handles the ipadrr assignation and host names) are not activated by default. The very first time I booted the nodes, they were not accessible from the master, and after activating the requested nodes I had to reboot them physically. Some of the scripts (which most of them are invoked through the main command perceus) have some minor imperfections with the access rights for non-root users to the perceus databases. However, the dev-team is simply amazing and answers promptly through the discussion list.

The new web portal in perceus 1.5, even if it looks a little too simple since it was just introduced in this release, lets you to change the VNFS capsule for a node or to update the description of the node.

Anyways, Perceus simply rocks and I keep giving it praise among my colleagues. A colleague just installed for his own beowulf cluster and is extremely satisfied with his results.

Reply
bainum

Perceus is magic. We built a cluster in the Fall 08
semester using perceus and the students had a great
experience…as did I. We experimented with running
both control and data networks to the compute nodes
and that worked fine with some changes to the
ipadrr module. Students were able to write programs
using openmpi. As a system to let students build
and program a cluster, perceus is excellent.
We started the semester with one version and midway
through the semester we did an upgrade. I want to
thank Greg Kurtzer and his team for developing
a cluster management system that works so well. I
also commend Jeff Layton on his articles on perceus.
The article in mid-2008 brought perceus to my
attention and after working with it during the
summer or 2008 I decided to use it in my cluster
class in the fall 08 semester. We continue to
run our perceus cluster with students.

Reply
kasozi5

Hi guys, i would like to get more detailed features for Perceus and also the system requirements for deploying Perceus.

Reply

Leave a Reply to spichardo Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>