"It's a bunch of scientists that don't know enough about system administration to realize that systems need administration." -- Daniel O. Cummings on scientists with supercomputers.
The provisioning process:
- You will need to manually (just once) power up all the service nodes if you have them, otherwise power up all the compute nodes.
- Each node will BOOTP/DHCP boot. All xCAT nodes must be configured to always boot network first; this is how xCAT controls destiny (next boot action).
- On boot the management node will issue a dynamic IP address to the node if the MAC is unknown; otherwise the statically assigned IP is issued. If static, then the TFTP entry in DHCP will determine what happens next. E.g. if not installed steps 4, 9, then 10 will be used. If ready to install, then provisioning starts after step 10.
- The node then TFTP downloads a special xCAT discovery netboot image (xcatk).
- After the very small Linux OS boots, it contacts
xcatd on the management node and hands over its MAC addresses and says, “try to find me”.
- xCAT knows the layout of all the networks and knows where to start looking. If the target machine is a Bladecenter, virtual machine, or HMC, then there is little work to be done since that information can be easily obtained. But for standard rack mount machines xCAT has to scan the switches via SNMPv3 to find the matching MAC address.
- Once a match has been made, xCAT can determine based on how the cluster was cabled the name of the node.
- xCAT using OMAPI updates DHCP without restarting it, then tells the node to re-DHCP an IP address.
- After the node obtains its proper address and knows its name it contacts
xcatd and asks for its next action.
- xCAT has a table that stores a chain of events. Based on a node’s current state the node can ask for its next action. A common scenario would be to discover, program BMC, flash BIOS, and then install OS. Each time an event completes the xCAT database moves a pointer to the next event, in the event that a reboot is required between events (e.g. after BIOS flash). In the case of event failure or an event is interrupted the pointer is not moved and the node will retry indefinitely.
If a cluster has been properly configured all that is needed is that first manual power up. Within minutes all the service nodes should self discover, self program the BMC, and self provision.
The rest of the non-service nodes function exactly the same way, i.e. if the service nodes are up, power up the compute nodes. The process is the same except that the service nodes provide all the services, e.g. DHCP, TFTP, etc.
Types of provisioning:
- Install to HD (old school). Supported by AIX (NIM), Red Hat and Red Hat-like (Kickstart), SuSE (AutoYast), Windows (native installer and imaging support (imagex)).
- Install to SAN (same as HD, old school for rich kids).
- Install to iSCSI. For x86 and x86-64 xCAT provides its own software iSCSI initiator based on gPXE. This allows xCAT to install any Linux or Windows that supports iSCSI without any iSCSI hardware or firmware. For PPC-based machines only Linux iSCSI is supported and gPXE is not used.
- Stateless. Linux only. Since 2005 this has been our recommended Linux cluster provisioning method. Just download the OS directly into memory and run it. No state to maintain. This is how the current Top500 #1 system operates. It can boot in about 10 minutes.
There are 3 flavors of xCAT stateless:
- RAM-root. The entire OS resides in memory. A full-featured HPC image requires about 75-160 MB of RAM (1-2% of 8GB).
- Compressed RAM-root. Same as (a) but stored compressed. Image requires about 30-64 MB of RAM (<1% of 8GB).
- Hybrid RAM-root. About 5 MB of RAM, the rest of the OS is read-only NFS mounted. Can scale 1000:1 if properly tuned.
Options (b) and (c) utilize a stacked file system (Aufs) and requires that the OS to be read-only. Copy-on-write is used for any system files that need to be changed. The changed files reside in memory and are lost on reboot (i.e. stateless). The changed files are usually configuration files that consume kilobytes of memory. Option (c) has the added benefit of image updates without reboots. Because the nodes are NFS mounted the results of added packages and files to the image instantly appear to the nodes. Stateless does not change the way application access data or run. Stateless is just a different way to provision the node. The OS boots and looks exactly like the Linux vendor intended it to boot. E.g. if you need to write huge amounts of data to /tmp, and you expect /tmp to be local disk, then you can still use stateless for rapid provisioning and still use the local disk as a device. Stateless does not have to be diskfree (however that is recommended to reduce cost and power).
Parallel and Distributed Tools
Figure Four: xCAT 2 Parallel and distributed tools
The xCAT parallel tools (psh, pscp, etc.) are a special type of action that does not take place through the service nodes. They have their own scaling built-in.
xcatd is still used to check the ACLs. The distributed tools (dsh, xdcp, etc.) however, do run through the service nodes and have their own scaling build-in.
Both set of tools were added to xCAT to maintain compatibility with existing xCAT and CSM users scripts. The xCAT set of parallel and distributed tools command line options and output are consistent with the rest of the xCAT commands. Because of this consistency xCAT has a command that can collate and group similar output to make large node range command output easier to digest.
xCAT 2 has a framework for creating, managing, and provisioning virtual machines. Currently IBM System p LPARs and Linux Xen VMs are supported. Support for KVM, VMware ESX, and Microsoft Hyper-V is under development. Linux Containers (lxc) will be added when Red Hat 6 and SLES 11 are released.
For the most part virtual machines are managed exactly like physical machines, however xCAT has a number of VM specific commands to aid with VM management. E.g.
rmigrate to move an active VM to another host,
revacuate to migrate all VMs from a host to hosts with the most available resources.
revacuate can also be triggered by pushing the soft power button on some of our servers. The server will evacuate all the VMs, gently shutdown, and then power off. VM GUI and text consoles are also available.
xCAT is not just for HPC any more. Many of the features and design points in xCAT 2 were created to expand xCAT’s scope to cover all scale-out solutions. A solution of high interest is Cloud Computing.
“Cloud Computing” like “Grid” and “Cluster” may mean different things to different people. IMHO, a cloud is a set of resources networked together and utilized and managed as a single dynamic entity. Sounds like a cluster. The difference is that a cloud is dynamically provisioned on-demand based on the requirements of the data and an SLA (Service Level Agreement).
For example, in a traditional HPC cluster or grid, all nodes, or nodes within a set run the same operating environment, i.e. OS, libraries, mount points, etc.. Through the use of file systems, users are permitted to run different applications and use different data sets, if and only if, the operating environment is compatible with their code.
This common practice works for most users. However, more datacenters are consolidating budgets into supercenters and different groups want different operating environments. Traditionally this has meant multiple clusters, and not all of them are being utilized efficiently.
A cloud-based approach would pool all the resources together and users would submit workloads with an extra parameter: operating environment. This parameter would allow more efficient use of resources because none are statically defined.
To keep various groups in check so that no one group monopolizes the cloud a system of SLAs and allocations will need to be put in place. With advanced scheduling, a cloud can determine if nodes will be unused for prolonged periods of time and power them off. Later, based on SLAs, nodes will be powered up and complete provisioning with the proper environment seconds before the workload needs to start.
With advances in virtualization and containers it is possible for some workloads to run virtualized. This adds new possibilities:
- Preempt low priority workloads without killing the job, i.e., suspend the workload.
- Migrate jobs to other areas for scheduled maintenance.
- Consolidate to fewer physical hosts while maintaining SLAs, and then power off more unused nodes saving power.
I could go on, but that is a topic for another article. So, where does xCAT fit into this? To have an effective cloud you must have scalable, robust, unattended system automation. xCAT provides this muscle. But you also need a brain, something that accepts workloads, SLAs, allocations, policies, etc. that can control xCAT and rub out the middlemen (sorry administrators). There are various brains to choose from. In the HPC space Moab has been most popular with the super-sized systems. Moab can also control xCAT via its client/server protocol. At SC08 (Supercomputing 2008) we demonstrated such a setup and we have a number of exciting joint projects in 2009. Expect some fantastic white papers.
xCAT has come a long way from a single HTML page to something that can control your world. For more, visit our project page. xCAT 2 is open source and uses an Eclipse Public License.