"It's a bunch of scientists that don't know enough about system administration to realize that systems need administration." -- Daniel O. Cummings on scientists with supercomputers.
“It’s a bunch of scientists that don’t know enough about system administration to realize that systems need administration.”
– Daniel O. Cummings on scientists with supercomputers.
That quote always cracks me up. Unfortunately, it is often true. Fortunately, there are many options, and most of them are freely available and open source. This article will describe one such option — xCAT: The Extreme Cluster Administration Toolkit.
xCAT: Ancient History
xCAT was created out of need. In 1999 IBM released its first 1U x86-based server to meet the demand of the dot-com explosion. At the time, the common practice for provisioning a large number of x86-based machines was to use a boot floppy to activate the NIC and start an OS install. Unfortunately, the new IBM 1U lacked a floppy drive. I was left with two options: bootable CD-ROMs or PXE-boot.
I opted for PXE since no one had the desire to create thousands of CD-ROMs. PXE booting was not new, but was uncommon for Linux network installations. Making it easy was the objective, and that is how xCAT development started.
xCAT Zero was released weeks later as an HTML document with a fistful of scripts and a few instructions on how to setup DHCP, TFTP, NFS, PXE, etc. for PXE-based network installations. A copy of this original document is located here
From that point on xCAT grew from almost nothing to a very large, almost impossible to manage, but successful, end-to-end cluster management solution. A large part of xCAT’s success came from the following principals:
- Build upon the work of others, i.e. do not reinvent the wheel. Whenever possible use existing open source solutions and best practices. E.g. why create an installer when Kickstart and System Imager exist?
- Use scripts. No compiled code. There were two practical reasons for this. First, system administrators like scripts and like to easily change code without the ubiquitous ./configure; make install. Second, xCAT was not open source–the only way to get the source to the people was to use scripts.
- Say, Yes. Give the people want they want. This decision put a lot of bloat in xCAT, but also a lot of interesting features as well. E.g. Windows install, bit-for-bit HD cloning (a customer wrote his own OS, it was the only way to install), and diskless DOS HD booting for remote flashing.
Of course, there were challenges with the above principals. Not every request could be granted. xCAT was an unfunded skunk works type of project developed mostly on my personal time. Why use all scripts? Many of the open source dependencies used compiled code and often required patches so that we could support an ever-increasing list of OS versions. Lastly, sometimes you have to reinvent wheels. Sometimes its technical and other times its political. Despite all the challenges, the community lead xCAT was very successful and used by IBM customers worldwide.
xCAT: Midlife Crisis
By 2006 feature creep had rendered xCAT development almost impossible. We had to support all versions of Red Hat, CentOS, Fedora Core, Scientific Linux, SuSE, and Windows on all supported architectures (x86, x86_64, ia64, PPC). And, the hardware kept changing. Development switched from proactive (support everything) to reactive (supporting as needed). Fortunately, xCAT’s feature set was more-or-less complete. And, we had more developers lending a hand. But, clearly xCAT was in maintenance mode.
xCAT 1: RIP
In February 2007 an IBM cross-organizational team decided that we should have only one scale-out management tool. We decided that neither xCAT or IBM’s CSM would meet all of our current nor future needs. So a new project christened xCAT 2 based on the best ideas of xCAT, CSM, PSSP, et al. would be created from scratch.
Our guiding principals also changed:
- Build upon the work of others. This was expanded to included ideas and not just code. Actually, we tried to ban all xCAT 1 and CSM code; this was to be a new wheel.
- Use Perl and C was now allowed. Not the friendliest of options, but we all knew Perl. Since xCAT 2 is an open source project (Eclipse Public License) compiled code is acceptable (if necessary). We found that very few administrators took advantage of changing the hodgepodge of KSH, BASH, Perl and Python scripts in xCAT 1, so we figured that few would change the hodgepodge of Perl scripts in xCAT 2.
- Say maybe (or sure – a programmers uncommitted yes). Say yes, wasn’t always the best policy, it created an impossible to manage solution in the long run. Because xCAT 2 had to be the future of IBM scale-out management, we needed a solution with a road-map, and with strong community direction. Compromises would be made for the greater good.
When xCAT 2 development started we determined that by late 2008 xCAT 2 should have enough features of xCAT 1 so that xCAT 1 could be safely retired at age 9. And there was no shame in that; xCAT 1 had a number of impressive achievements:
- It was used for the initial four TeraGrid sites.
- It witnessed the rendering of Lord of the Rings (one tool to rule them all).
- It was used for the first 1PFLOP Top500 system (xCAT 2 was used for the first 1.1PFLOP Top500 system, 5 months later).
xCAT 2: The Next Generation
xCAT is a scale-out management system with the following primary functions:
- Remote Hardware Control. E.g. power on/off/status, temperatures, voltages, watts, fan speed, etc., whatever the supported service processors can return. Presently xCAT understands how to control IPMI-based BMCs, IBM BladeCenter®, and IBM System p nodes.
- Remote Console Management. Terminal servers, IPMI SOL, BladeCenter SOL and System p HMC supported.
- Automagic Discovery and Destiny Control. xCAT can correctly gather MAC addresses in parallel for any number of nodes. Destiny is the machines next boot state (e.g. install, boot HD, stateless boot, etc.).
- Automated Unattended Provisioning. xCAT supports diskful (local disk), disk-elsewhere (iSCSI, and Fibre Channel SAN), and stateless (diskfree). The term diskless is not used because it is ambiguous.
xCAT also includes a number of post provisioning administration tools, e.g.:
- Parallel shell, copy and rsync.
- Monitoring Setup (e.g. Ganglia, PCP, and RMC).
- Hardware SNMP/IPMI Alert Decoding.
And all of this must scale. xCAT 1 was designed to support at scale around 10,000 nodes (the largest known xCAT 1 cluster was a Windows cluster of 20,000 nodes). xCAT 2 increases that by 10 times.
xCAT 2: Architecture
The following figure provides the architecture of xCAT 2. The key components will be described below.
Figure One: xCAT 2 Architecture
New in xCAT 2 is the use of a database (DB). As you can imagine, this created an outcry by many more experienced cluster administrators because nothing is more flexible than flat files. But, flat files do not scale well when there are multiple users reading and writing to them. As xCAT 1 grew in functionally, database functions such as locking and join table queries had to be implemented to keep moving forward. Why reinvent the database?
To appease the angry masses xCAT 2 has an export to
$EDITOR command and a number of CLI tools for direct script-based manipulation. The EDITOR may be vi, emacs, oocalc, or anything else that can edit a comma delimited file. After edits xCAT will inspect the changes and commit to the database if there are no errors.
The advantage of a database is xCAT can scale, safely, with little concern about corruption and deadlocks. This feature becomes more important with adaptive or autonomous management – things just need to work.
Currently PostgreSQL and SQLite have been tested and are supported, but any Perl-DBI database should work. A database (e.g. PostgreSQL) with network client/server support is required in scale-out management environments where service nodes are used to scale up xCAT services. However, just because it is a database, it does not indicate that any application can be or should be created to interface directly with xCAT data. E.g. misinterpretations of regular expressions stored in the database.
Regular expressions can be used and stored in the database to aid the administrator with the dull task of defining a cluster. This feature can save hours and reduce mistakes. All xCAT code understands this and uses an API (xcatdb) to properly read and write data.
Figure Two: xCAT 2 Database design
(Any resemblance to the Star Ship Enterprise is just a coincidence.) As illustrated above in Figure Two, DBI is the only access to the database and xcatdb is the only access to DBI. The safest way to interact with xCAT and xCAT data is via the xCAT client/server interface. This design will properly represent all cluster data. NOTE: Triggers have not been implemented yet.
The heart of xCAT 2 is
xcatd. All administrator requests enter and exit
xcatd via an XML/SSL interface. xCAT plug-ins on start up defines this relatively simple protocol, i.e. when
xcatd starts all plug-ins are loaded and the contents of the plug-ins defines the vocabulary available to the clients.
Clients can exist anywhere on the network and do not need to be on the management node. The CLI clients are written in Perl (client.pm). The web-based GUI clients are written in PHP (client.php). Java GUI clients use client.java. Python, C , and PowerShell client libraries will eventually be developed. Currently the CLI client code is packaged as Linux RPMs, but other package formats will be created for OS/X, Debian-based distributions, and Windows. An iPhone client is under consideration as well.
All Clients must have a set of SSL keys created by the xCAT administrator before any attempt can be made to communicate with
xcatd. Once an SSL session is established an action can only take place if the user has the proper ACLs setup. xCAT 2 unlike xCAT 1 does not require (or recommend) that root administer the cluster. The practice of no user accounts on the management node and no root administration (yes, you can disallow root access to
xcatd) would be ideal, but this is often not the case.
ACLs can be defined per user and can limit what actions that user can execute and when they can execute them. E.g. you may permit all users access to vital data (e.g. temperatures, watts, etc.), junior administrators the ability to power on machines (but not off), the Moab Scheduler the ability to provision on-demand (Moab uses client.pm), and the senior administrators the ability to do anything. However, before any actions can take place
xcatd must setup a number of services used to discover and provision the cluster, e.g. DHCP, NFS, DNS, TFTP, etc.
There is a special type of
xcatd instance that runs on service nodes. Service nodes are helpers; they provide the same services as the management node to aid with scaling. Service nodes are stateless Linux instances (no local disk required) created by the xCAT administrator when xCAT is installed and setup. The more machines you plan to monitor or provision in parallel the more service nodes you will require. E.g. the current Top500 system has 18 service nodes and can provision all ~10,000 nodes in parallel.
When a service node boots it checks-in with
xcatd on the management node. Once
xcatd determines that this is truly a trusted service node,
xcatd then hands over the keys so that both
xcatd instances can have bidirectional communication without the need for ACLs. After this trust relationship has been established the management node can call all the shots and distribute actions to the service nodes.
Service nodes can be setup the following ways:
- Static with static node assignment. This is most common and the easiest to setup and understand. But it can be a single point of management failure.
- Pooled static with static node assignment. A pool of service nodes can provide HA services for discovery and provisioning, however
xcatd commands delegated to service nodes will follow a static assignment. In the event of a pooled node failure, the xCAT administrator can simply change the assignments. This can also be automated with HA software.
Actions do things, e.g. power on a node, install an OS, flash a BIOS, etc. and are defined by the plug-ins created for
xcatd. If you have service nodes,
xcatd will determine based on how you defined your cluster which actions get farmed out to which service nodes.
Actions return status, output, and errors to whichever
xcatd initiated the action. E.g., if a service node initiated the action then that status/output/error is forwarded from the service node
xcatd to the management node
xcatd and then finally to the client.
Actions are fairly straightforward. If you have every built a cluster by hand, you did all the actions.
xcatd simply automates in a consistent, scalable way, similar actions. And, all actions and who is responsible is logged to syslog. If you want to expand xCAT’s set of actions you can simply create your own plug-in. xCAT provides a “Hello World” plug-in as a place to start.
Figure Three: xCAT 2 Provisioning scheme
Before provisioning a cluster with xCAT the cluster must be defined to xCAT. xCAT must know every IP address, network, route, node name, where nodes will get services from, what OS to install, how to install it, what to install to, etc…. xCAT must also know the system physical network layout all the way down to which node is connected which switch in which port. And, if you want xCAT to report where problems are, then you will also need some type of coordinates (e.g. room, row, rack, U#).
This may seem like a daunting task, but most systems have some type of order. And if you have order then regular expressions can be used to automate a much of the data entry. Once a system has been defined, you are ready to provision.