dcsimg

Highly-Affordable High Availability

High-availability clusters can provide a big reliability bang for your budget bucks. High-availability guru Alan Robertson shows you how.

If you’re a system administrator, you’ve already had it happen: you’ve just ordered lunch when your pager goes off. No lunch for you today. Or maybe you’re on the other side of the fence: the server is down, and your system administrator can’t be found. You miss your deadline because no one’s available to fix your critical system.

High-availability (HA) clusters can dramatically cut downtime, and since service failovers are fast and automatic, system administrators get to finish their lunch and users get to finish their work. “Admins” are happy, users are happy, even pointy-haired managers are happy, because minimizing work stoppages saves money.

Although high-availability means different things to different people, here it refers to highly-available clusters. An HA cluster is a set of servers that work together to provide a set of services. In an HA cluster, services don’t belong to any one server in the cluster, but to the cluster as a whole. If one server fails, its services are provided quickly and automatically by another server.

While HA systems can’t eliminate outages completely, they can make hiccups very, very short. And when they’re short enough, they can go unnoticed or will get blamed on something else — like a “glitch” in the Internet. When working as it should, an HA system is like an illusionist’s trick, where the hand is faster than the eye. Indeed, an HA cluster that’s properly designed, configured, installed, and managed should add a “9″ to your availability, cutting your downtime by 90%. (See the sidebar “The Magic of Nines” to understand how availability is commonly measured.)




The Magic of Nines

The availability of a service is commonly measured by how many “9′s of availability” it provides. If a server is up 90 percent of the time, it has one 9 of availability. If it’s up 99 percent of the time, it has two 9′s of availability, and so on. If you translate these “number of nines” into how much downtime is allowed per year, you get something like this:

No. of NinesAvailability Downtime/Year









1

90.0000%

37 days

2

99.0000%

3.7 days

3

99.9000%

8.8 hours

4

99.9900%

53 minutes

5

99.9990%

5.3 minutes

6

99.9999%

32 seconds

Even if you start with an unreliable operating system, add unreliable software, and put it on flaky hardware, a good HA cluster package can still make things better. You may get up to three nines if you’re lucky. But, if you start with enterprise-class hardware with good maintenance features, add a stable Linux kernel, put on rock-solid applications, mix in some good administrative training and procedures, you can look forward to much better results, perhaps five nines or more.

A Real-life HA Server

Let’s see how to configure and deploy a real HA server. The sample cluster [based on the author's personal development cluster] provides four HA services: NFS, Samba, DHCP, and Postfix (a mail relay), and is based on two x86 servers, connected as shown in Figure One. The system can be used to simultaneously develop software, write documents, and store email. It can also be used to serve music files. And since it’s an HA system, even if one of the servers crashes or is down for maintenance, the music just keeps on playing. HA email and a bulletproof jukebox — what more could you ask for?








availability_01
Figure One: HA cluster physical view

Each server pictured in Figure One is an x86 system running SuSE Linux Enterprise Server 8 (SLES8) with one IDE boot/root drive, and one 80 GB drive for /home. SLES8 was chosen because it comes prepackaged with reasonable versions of all the needed software.

The Heartbeat package is used to detect failures and manage cluster resources. The DRBD package (described briefly here and much more extensively in the feature beginning on page 30) keeps the two copies of /home (one on each server) continually in synch. DRBD can be thought of as RAID1 (mirroring) over a LAN. Each machine is connected to a LAN by a 100 megabit connection, and the two machines are interconnected with a dedicated 100 megabit link for DRBD filesystem synchronization and a serial link for sending heartbeats. Each machine has its own UPS for power protection.

This is a minimal configuration for a high-availability server with shared data. For higher-throughput (write-rate) systems with fast disks, the dedicated link should be a gigabit link. If you use gigabit NICs, they only add a small amount to the cost, and the total cost of putting together such a system remains very low. Exactly how low the system price is depends on what kind of server hardware you start with.








availability_02
Figure Two: HA cluster service view

Another way to view the system is to see how the various components interact within an active server. Figure Two illustrates that view.

Architecting Your HA Configuration

High-availability clustering is designed to protect your system against failures. So, as you design your own HA system, it’s important to look for single points of failure (SPOFs) in your design. If there’s a single item whose failure causes the whole cluster to fail, that’s a SPOF. The cure for most SPOFs is redundancy. In fact, the “three R’s of high availability” systems are Redundancy, Redundancy, and Redundancy. If that sounds redundant, then maybe that’s appropriate.

As you look at the system architecture for the sample cluster, you’ll see redundant servers, redundant uninterruptible power supplies, redundant disks, and so on. These redundancies are what allows HA clustering to work effectively.

This architecture has no internal SPOFs. No matter what fails in the cluster everything can be recovered. Although the loss of the replication link will stop the data from being replicated to the slave disk, it won’t cause system failure, so it isn’t a SPOF. (Although we’ve configured a replication cluster here, shared disks are also commonly used. For a discussion of shared disks versus replicated data, see the sidebar “Shared Disk vs. Disk Replication,”) Service can even survive destruction of the primary system by fire.




Shared Disk versus Disk Replication

DRBD replicates data between two disks of any kind and provides very inexpensive storage with no single points of failure. However, it also doubles the storage requirements and incurs some occasionally lengthy resynchronization intervals after crashes. It can also slow down disk writes in some applications.

For many higher-end applications, these disadvantages are troublesome. For those applications, people often use shared disk arrangements instead. These can be multi-attach SCSI RAID boxes, dual controller RAID arrangements (like IBM’s ServeRAID), shared fiber-channel disks, or high-end storage like IBM’s Enterprise Storage Server, or the various high-end EMC solutions. These systems are relatively costly (ranging from $5K USD to millions of dollars). However, they don’t suffer from the latency increases or the more frequent, full resynchs.

Of course, only the most expensive of these solutions avoid internal single points of failure.

Under The Covers: How HA Clustering Works








availability_03
Figure Three: Normal HA configuration

HA clustering software monitors the servers in the cluster — typically using a heartbeat mechanism that acts a bit like the Linux init system for the cluster as a whole. That is, the heartbeat starts and stops services so they are always running somewhere in the cluster. One of the most popular HA packages, and the one used in the sample cluster, is called Heartbeat.

Heartbeat uses scripts very similar to standard init scripts to start and stop services. Heartbeat manages resources by groups, and a group of resources always runs on the same machine in the cluster. In addition to normal service scripts (like nfsserver and dhcpd), Heartbeat also manages individual IP addresses as resources through the IPaddr resource script. Resource groups are configured in the /etc/ha.d/haresources configuration file, as explained below.

As mentioned earlier, DRBD is a disk replication package that makes sure every block written on the primary disk gets copied to the secondary disk. From DRBD’s perspective, it simply mirrors data from one machine to another, and switches which machine is primary on command. From Heartbeat’s perspective, DRBD is just another resource (called datadisk) that Heartbeat directs to start or stop (become primary or secondary) as needed.








availability_04
Figure Four: Failed over HA configuration

For a cluster providing its services through one IP address, you need three semi-public IP addresses: one for each machine for administrative purposes, and one to talk to the services in the resource group. In the sample cluster, the address 10.10. 10.20 is the service address. That is, whenever anyone wants NFS, Samba, or Postfix services, they connect to 10.10.10.20. Heartbeat makes that IP address available on whatever machine is running the resource group.

In the normal configuration, as shown in Figure Three, paul provides the services and owns the homeserver IP address at 10.10.10.20. If paul fails, silas takes over the homeserver virtual IP address and the corresponding services. If clients try and contact homeserver when paul is down, they reach silas. This situation is illustrated in Figure Four. Now that you know how it all works, here’s how to build it.

Prepare the Hardware

There are four cluster-specific things to connect: the disks, the crossover NICs, the crossover serial cable, and the UPS control cables.

* First, install the disks according to usual Linux procedures (see September 2001′s “Guru Guidance” column, available online at http://www.linux-mag.com/2001-09/guru_01.html), but don’t create any filesystems on them.

* Next, install the NICs, and configure both NICs on private addresses on the same subnet in the ranges in the 192. 168.0.0/16 or the 10.0.0.0/8 range.

Acquire a serial cable intended for PC-to-PC communication. Be sure that the cable includes null modems, and includes the CTS and RTS leads.

Connect each computer to its own UPS.

Although these directions are somewhat x86-specific, all the software runs on all Linux platforms, so you’re not restricted to a specific form of hardware.

Install the Software

For this cluster, there are several packages to install. You need: heartbeat-1.0.3, heartbeat-pils-1.0.3, heartbeat-stonith-1.0.3, and drbd-0.6.3. Each is available for SLES8 — just grab the latest versions from SuSE. If you’re not running SLES8, you can get the packages from http://linux-ha.org.

Install the packages using rpm or yast2 or your favorite method. Of course, you’ll also need to install whatever services you want to support. For the example, that’s nfs-utils, samba, dhcp-base, dhcp-server, dhcp-tools, and postfix.

Configure DRBD

DRBD is configured through the file /etc/drbd.conf. The file has some global parameters and some local parameters. (The drbd. conf file for the example system is shown in “Configuring DRBD.”) Make sure to set the disk sizes correctly.




Configuring DRBD

Here’s the content of /etc/drbd.conf for the sample configuration.


resource drbd0 {
protocol=C
fsckcmd=/bin/true

disk {
disk-size=80418208
do-panic
}
net {
sync-rate=8M # bytes/sec
timeout=60
connect-int=10
ping-int=10
}
on paul {
device=/dev/nb0
disk=/dev/hdc1
address=192.168.1.1
port=7789
}
on silas {
device=/dev/nb0
disk=/dev/hdc1
address=192.168.1.2
port=7789
}
}

To compute your disk size, use blockdev — getsize and divide the result by 2. If the two sides give different results use the smaller value.

Next, make a filesystem on paul. It’s important that you use one of the journaling filesystems for the filesystem type, and for this example, that you make the partitions exactly the same size.

This means you need to choose one of Reiserfs, Ext3, JFS, or XFS. And, because we’re using DRBD, it’s safer to make the filesystem on the /dev/nb0 device rather than the underlying device.

Here are the commands to run on paul:


# /etc/init.d/drbd start

When prompted to make paul primary, say “Yes.” Next, you need to make the filesystem and mount it.


# mkfs -t reiserfs /dev/nb0 datadisk
/dev/nb0 start

Finally, if you’re using a gigabit Ethernet connection for synchronization, change the sync-rate parameter, which limits the maximum speed for resynchronizations.

Configure Heartbeat

Heartbeat has three configuration files: ha.cf configures basic cluster information; haresources configures the init-like resource groups; and authkeys configures network authentication. Sample versions of these files can be found in /usr/share/ doc/packages/heartbeat, and are documented in Heartbeat’s “Getting Started” document. These three files need to exist on both machines in the cluster.

ha.cf provides Heartbeat with basic configuration information. It configures the nodes in the cluster, how things should be logged, where to send heartbeats, and parameters concerning the heartbeat interval and dead time interval. This is the /etc/ha.d/ha.cf file for our sample cluster:


logfacility local7# syslog facility
keepalive 1# HB interval
warntime 2# late HB
deadtime 10# failover time
nice_failback on#
node paul silas
ping 10.10.10.254# router addr
bcast eth0 eth1# HB bcast intf.
serial /dev/ttyS0# HB serial link
respawn /usr/lib/heartbeat/ipfail
stonith_host paul apcsmart silas /dev/ttyS1
stonith_host silas apcsmart paul /dev/ttyS1

In the example file above, heartbeats are sent across eth0, eth1, and /dev/ttyS0. For our example (and most clusters), this file is identical across all the nodes. And as noted in the earlier pictures, the power supplies are configured as stonith devices, which are discussed in the “STONITH” sidebar.




STONITH

STONITH is an acronym for “Shoot The Other Node In The Head.” It’s a technique that Heartbeat uses to ensure that a supposedly dead server doesn’t interfere with current cluster operation, and more specifically, that it doesn’t damage any shared disks.

If you have shared disks, then STONITH is mandatory. Otherwise, some kind of misconfiguration or software bug might cause each server to think the other side is dead. This is called a

split-brain

condition. If they both mount a shared disk simultaneously, then the data on it is destroyed. This is generally thought to be a bad thing.

There are some types of disk sharing arrangements like IBM’s ServeRAID where the hardware guarantees that no more than one computer can access the disk at a time, so they don’t need STONITH.

If you’re using DRBD, the consequences of split-brain are a little less severe, and for some applications you may be able to ignore them. When using DRBD, a split-brain will cause both sides to become primary and modify their copies of the data separately. Unfortunately, when the two systems come to their senses, you will have to throw away the updates on one of the two systems. If you can live with throwing away good updates during the rare split-brain condition, that cluster can get by without STONITH. If you cannot live with this, then you must configure a STONITH device.

To find out what kinds of STONITH devices Heartbeat currently supports, issue this command:


# /usr/sbin/stonith -L

To get the complete list of information on all these devices and how to configure them, issue this command:


# /usr/sbin/stonith -h

Here’s the /etc/ha.d/haresources file:


paul 10.10.10.20 \
datadisk::drbd0 \
nfslock nfsserver nmb smb \
dhcpd postfix

This file creates a single resource group, nominally belonging to paul, containing the IP alias 10.10.10.20, the datadisk (DRBD) resource for drbd0, and the NFS, Samba, dhcpd, and Postfix resources. Heartbeat uses the :: notation to separate arguments to the init scripts. (This is the primary difference between Heartbeat scripts and normal system init scripts.)

To clarify where all these scripts are located, IPaddr and datadisk are located in /etc/ha.d/resource.d/. The other scripts are found in /etc/init.d/, the place normal init scripts are found.

Heartbeat is happy to manage most services that come with init scripts, without any extra work. However, the script names must be identical on all servers in the cluster. (Script names tend to differ between distributions, so using a single distribution across all servers tends to make configuration and maintenance easier.)

Here’s the /etc/ha.d/authkeys file:


auth 1
1 sha1 RandomPasswordfc970c94efb

authkeys is the simplest of the configuration files. It contains the authentication method (sha1), and a key to use when signing packets. This file must be identical on all servers in the cluster, and may not be readable or writable by any user other than root.

Configure Services

Services cannot be simultaneously controlled by both Heartbeat and init. Next, disable the nfslock, nfsserver, nmb, smb, dhcpd, and postfix services from starting at boot time. Do that by issuing the following command:


# chkconfig –del nfslock nfsserver \
nmb smb dhcpd postfix

Also make sure that the /home partition is not already mounted automatically from /etc/fstab. If there’s an entry for /home in fstab, remove it. Next, add an entry like this one:


/dev/nb0 /home reiserfs noauto 0 0

If /home is currently mounted, unmount it.

In most applications, it’s necessary to have a name to go with the service IP address. If you use /etc/hosts for your network, you’ll need to add a line like this to your /etc/hosts file:


10.10.10.20 homeserver # HA services

If you use DNS, update your DNS servers accordingly. Then clients can add a line like this to /etc/fstab:


homeserver:/home /home nfs \ defaults 0 0

For some services, it’s necessary to move their state data to the replicated disk. It’s also convenient to move as many HA service configuration files to the shared disk as possible. That way, one copy of these configuration files exists, and you can’t accidentally forget to update one of the copies on the cluster.

Next, create a directory called /home/HA-config/. This will mirror portions of the /etc/ and /var/ directory structures. Then move the following files and directories to /home/HA-config/etc/: /etc/postfix/, /etc/samba/, /etc/exports, and /etc/dhcpd. conf, and replace them in the real /etc/ directory with symlinks that point to the pathnames on /home/HA-config/.

Next, do the same thing for the following directories in /var: /var/lib/dhcp/, /var/lib/nfs/, /var/lib/samba/, /var/spool/mail/, and /var/spool/postfix/. The idea of this is that when applications use these files, they will get the files off the replicated /home directory instead of the local root disk.

Next, unmount /home like this:


# datadisk /dev/nb0 stop
# /etc/init.d/drbd stop

Services often need to be told what IP address you want them to use. In the case of nfslock service, the /sbin/rpc.statd program needs to be told the address to advertise NFS locks on by adding the -n homeserver option to the invocation of rpc.statd found in /etc/init.d/nfslock. For Samba, add an interfaces option to the [global] section of /etc/samba/smb.cf:


interfaces = 127.0.0.1/8 10.10.10.20/24

Next, tell Postfix to treat requests coming to the service address as requests from local machines by adding this line to /etc/postfix/main.cf:


inet_interfaces = 127.0.0.1, 10.10.10.20

Testing DRBD

No matter how much you think you know about these services, or configuration, DRBD, or Heartbeat, you must test your HA system. The more thoroughly you test it, the more you’ll know about how things work, and the more confidence you’ll have in the result. An HA system that isn’t well tested won’t be highly-available. (Good HA testing could be an article in itself.)

When you use DRBD, you’re trusting it to replicate data exactly. It is as vital as the disks and the filesystem code. For now, disable DRBD and Heartbeat from automatically starting with the commands:


# chkconfig –set drbd off
# chkconfig –set heartbeat off

Remember to run the commands on both machines. Now, reboot both servers. On silas, issue this command:


# /etc/init.d/drbd start

On paul, enter:


# /etc/init.d/drbd start

You should see that silas‘s console has now continued. You can verify that DRBD has made paul primary and silas secondary by running this command on paul:


# cat /proc/drbd

You should see something like this…


0: cs:SyncingAll st:Primary/Secondary

… which indicates that everything’s been started correctly, and that a full synch is underway. If you are using a 100 megabit link and large disks, this resynchronization takes a while. You can check progress in /proc/drbd.

To test DRBD, follow the instructions in the DRBD article in this issue, or wait for /proc/drbd to indicate that the full synch is done.

Testing Heartbeat

Next, issue an /etc/init.d/heartbeat start on paul. This starts up the Heartbeat service, producing copious messages in /var/log/messages. To verify that everything’s working properly, run the following commands:


# mount | grep /home
# ifconfig | grep 10.10.10.20
# /etc/init.d/nfslock status
# /etc/init.d/nfsserver status
# /etc/init.d/nmb status
# /etc/init.d/smb status
# /etc/init.d/dhcpd status
# /etc/init.d/postfix status

/home should be mounted, the IP address should be set to 10.10.10.20, and all of the services should be running.

Next, start Heartbeat on the other node. Heartbeat will start up like the first node, except it should not start the services.

Migrating Services Manually

Next, tell Heartbeat to move the services from paul to silas by logging into paul, and issuing the command:


# /usr/sbin/heartbeat/hb_standby

Heartbeat quickly moves the entire set of services over to silas. The whole process should take about 15 seconds. Next, log into silas, check the logs, and verify that the services are all running. Next, issue the hb_standby command on silas, to move all of the services back to paul. Check the logs on paul to verify the services are running there again.

Simulating Network Failures

Because there is a ping directive in ha.cf, Heartbeat dutifully pings the router from each machine every second. And because ipfail was started in ha.cf, it monitors the results to see which machine has better connectivity.

At this point, your services should be running on paul. To test ipfail, disconnect paul‘s eth0 connection; the resources should migrate to silas. Restoring connectivity to paul, and removing it from silas should cause the services to move back.

Simulating Crashes

On to braver tests, and testing crashes. If you followed all of the previous test procedures, the services should be running on paul. Issue the following command on both machines to cause Heartbeat and DRBD to start automatically at boot time:


# chkconfig heartbeat 35
# chkconfig drbd 35

Next, press the reset button on silas. After it reboots, it will start a quick synch with paul, which should complete in a few seconds. After it completes, press the reset button on paul. The services should migrate over to silas with about a ten second delay.

Keeping On

Creating, configuring and testing high-availability systems is an interesting and complex activity that’s only touched on here. However, as you can see, for the cost of a serial cable, some NICs, the price of a few hard drives, and a little of your time, you can create an effective HA cluster — if only with two machines. Read the documentation that comes with Heartbeat and DRBD, and join the Linux-HA and DRBD mailing lists to learn even more. The Linux-HA home page is http://linux-ha.org.



Alan Robertson works for the IBM Linux Technology Center where he is the chief cook and bottle washer (project leader) for the Linux-HA project. He can be reached at alanr@unix.sh.

Comments are closed.