Data can be the currency, Intellectual Property, and life blood of many a company. One technique to make sure that your data is readily available is data replication. Not quite the same as data backup but can be equally important.
One the best tools in a system administrator’s arsenal is rsync. It was developed by Andrew Tridgell of Samba fame. Rsync is a very cool tool that synchronizes files and directories between storage pools using minimal data transfers between the pools. It uses what is called delta encoding to help minimize the data transfer and can also use compression and recursion to make the synchronization very efficient. The key point is that rsync synchronizes file data between storage pools. This means that the underlying storage doesn’t have to be the same between the different storage pools.
Configuring rsync is very easy and there are lots of tutorials and articles on the web that show how to integrate it with various forms of authentication and transmission protocols such as ssh, pam, kerberos, and even can accommodate encryption. The first step in using rsync is to create two storage pools that are separated by some geographic distance (that’s the whole point of replication even if the distance is just a few feet). Ideally, the two pools should be identical but they don’t have to be in the case of rsync since replication happens at a file level (one advantage of rsync). So really you only need to have the same capacity in the secondary storage pool as in the primary pool.
The next step is to choose the primary storage which will run the rsync server (daemon). The secondary storage runs an rsync client. Based on the exact rsync command the rsync server will create the needed data and send it to the rsync client. When or how often the rsync server runs is up to you. You can put it in a cron file and choose the time interval between rsync operations. If no data has changed on the server since the last rsync then no data is sent (this could happen in the evening when most people, except for us technical types, are home watching the The Big Bang Theory).
The third step is to configure the rsyncd.conf file on the rsync server and create the rsync process (e.g. cron job). There are plenty of tutorials that describe this so I won’t discuss it in this article but it’s actually very easy to do and there are many, many articles that talk about how to use rsync with various security tools including encryption.
DRDB (Distributed Replicated Block Device) is a mechanism for what is basically RAID-1 across a network. It takes the blocks from one storage pool and mirrors them on a different storage pool across a network (TCP based network). One of the key aspects of DRBD is that it is block based. Figure 1 below, taken from the DRBD website gives a fairly detailed overview of how it works.
Figure 1: Overview of DRBD (from www.drbd.org)
The left hand side of the diagram is the primary storage pool. You can follow the data flow by tracking the orange arrows. As the data is written from the service at the top toward the actual disk at the bottom left, DRBD copies the data to the TCP/IP stack where is it sent to the secondary storage pool on the right hand side. It is grabbed by DRBD on the secondary storage pool and sent the actual storage devices. After the data is copied to the network by DRBD on the left hand side, it continues normally to the I/O scheduler and ultimately the disk.
If you haven’t noticed in Figure 2, all of these operations happen in kernel space (inside the box). For a long time DRBD existed as a set of patches outside the kernel. However, in 2.6.33, DRBD was included in the mainstream kernel.
In operation, DRBD layers block devices over existing block devices. For example, it layers a DRBD block device such as /dev/drbd1 over a physical or logical device such as /dev/sdb1. Then you use the DRBD block device for the file system. It is recommended that the underlying block device (e.g. /dev/sdb1) be a logical volume that is built using LVM. This allows you to grow the storage on the primary to meet capacity needs, but remember that you will have to also increase the capacity on the secondary storage pool to match the primary.
There are a number of tutorials on the web showing how to set up DRBD between two storage pools. Despite the fact that replication happens in the kernel, it’s actually fairly easy to configure and use DRBD.
This article has just been a quick overview of replication in Linux. Replication is the mechanism for making a copy of data from a primary storage pool to a geographically distant secondary storage pool. If you like, you can think of it as mirroring data over a network. The goal is to have a secondary storage pool that is an exact copy of the current set of data and can be used in the event that the primary storage pool becomes unavailable. As as result, many times replication will be used for disaster recovery.
A key point is that replication is fundamentally different from backups. A backup is designed to keep past versions of data that are available for restoration if needed. Plus a backup may not have the latest copy of the data whereas replication is designed to have a copy of the data that is as close as possible to the original. Theoretically you could use a backup to restore a complete storage pool but it would take a great deal of time and would be missing any changes in the data from when the storage pool went down relative to the last backup.
I briefly mentioned two replication options in Linux: (1) rsync, and (2) DRBD. Both are fairly easy to configure but they differ in one fundamental way – rsync is file based and DRBD is block based. Both accomplish the same goal of replication but DRBD, since it is in the kernel, has a smaller data “gap” than rsync. This means that the difference between the data in the primary storage pool and the secondary storage pool is smaller for DRBD based replication than rsync replication. How much smaller depends upon how rsync is configured, how much data is being replicated, and the network characteristics between the storage pools.
Replication is one of the mechanisms used in storage management if it is required to ensure that data is always available. That is, you have a copy of the data readily available so that if the first copy is lost you can still function. Replication is used in enterprise storage a great deal and can even be used for home use. If the data on your desktop or home server is important to you and you need to make sure that the current state of the data is available, then replication is something you can easily configure. Given the price of home storage and networks, it is fairly easy to configure a secondary storage pool for your home server. As a system administrator told me when I first became an admin, “It’s best to wear a belt and a pair of suspenders.”
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).