Distributed Filesystems for Linux

Linux has always made a great file server, but now a whole new breed of open source filesystems is taking the file server idea to the next level. We show you what they are and tell you how they work.

Dist Files opener center

Dist Files opener left

Way back in the day, if you wanted to share files between two computers, you copied them onto a floppy disk and walked them over to the other computer. This was known as “sneaker net.” What a difference a couple of years makes. Today, personal computers running Microsoft Windows and Apple’s MacOS inherently support sharing disks and directories with other systems of the same type. And of course, the ability to share disks, directories, and files over a network has always been a hallmark of Unix and Unix-like systems (such as Linux).

Unfortunately, all of that convenience does not come without a price. Sharing filesystems over networks, and managing updates to files and directories in those filesystems, becomes more complex when many computers and users can access those shared resources. Filesystems that can be shared over networks are more properly known as “distributed filesystems” because shared files and directories are available on (or distributed across) many different computers on a network. Most distributed filesystems are examples of client/server computing, where servers export files and directories and accept updates to them, while clients import files and directories and send updates.

The history of distributed filesystems on commercial Unix and Unix-like systems began with the proprietary Domain network filesystems introduced on Apollo workstations in the early 1980s. It continues through Sun MicroSystems’ NFS (Network Filesystems), introduced in the mid 1980s, and culminates in more sophisticated, higher-performance network-based filesystems such as IBM’s AFS, Carnegie Mellon University’s CODA, and Stelias Computing’s InterMezzo. NFS was the first open implementation of a networked filesystem; its specifications and protocols were publicly available, and NFS was therefore quickly supported on many different types of computer systems.

Dist Files opener right

Distributed filesystems provide significant advantages to almost everyone who uses or manages computers. They enable users to access their data files in exactly the same way from different computers. If the machine on your desk fails, just use another; your files are still intact and safe on the centralized file server. Maximizing information sharing and availability on college campuses was the genesis of many of the projects that led to the filesystems described in this article.

For system administrators, storing important data on centralized file servers, rather than on individual desktop systems, simplifies administrative tasks such as backing up and restoring files. It also reduces hardware costs for desktop systems and centralizes standard system administration tasks such as creating/deactivating accounts, monitoring filesystems use, and so on. IS and IT managers who are responsible for enterprise computing services may already be using a distributed filesystems (specifically NFS or Samba), but there are many alternatives with better performance and more features.

Principles of Distributed Filesystems

Distributed filesystems must provide two basic features:

Transparent Access to User Files from Different Computers:This means that users are not aware of whether their files are local or located on a remote server. Storing files on centralized file servers frees users from having to access their files from specific machines. Users of well-designed distributed filesystems should be able to log in on any machine in their computing environment and access and update their files in exactly the same way as they would on any other machine.

Network-Oriented Authentication: If users can log in on any machine and access their files transparently from those machines, replicating authentication (password, group, etc) information on every machine quickly becomes an administrative nightmare. For this reason, networked file systems typically go hand-in-hand with a networked authentication mechanism; this supersedes traditional sources of login information such as /etc/password and /etc/group on Linux/Unix systems. Network-oriented authentication mechanisms such as NIS, NIS+, and Kerberos are designed to provide network-oriented authentication.

Another important feature of modern distributed filesystems is replication. This is the ability of distributed filesystems to provide online copies (replicas) of existing portions of the distributed filesystems. Should a file server that exports a primary portion of the distributed filesystems become unavailable, replicas are instantly and transparently available to users without interrupting their work.

Support for disconnected operation is related to replication. “Disconnected operation” is the term used to describe the ability to access and work on one’s files when not connected to a server. Local copies of files and directories are automatically resynchronized with the copies on the server when connectivity is restored. This is especially important to laptop computer users.

Now that we’ve talked about what distributed filesystems are, let’s take a look at some of the products that are available for Linux today (and are under active development). The Finding the Right Filesystem table on page 66 provides a high-level comparison of them based on common issues in distributed computing.

Implementing Distributed Filesystems — Locks And Callbacks

Figure Two: Using callbacks to notify clients of file updates helps solve consistency problems.

Dist Files Fig 1w/outlines
Figure One: Retrieving a file from a file server.

The machines participating in a distributed filesystem fall into two classes: servers, which deliver files and directories and accept updates to them, and clients, which request files and directories and deliver updates to them. Distributed filesystems use a variety of mechanisms to guarantee that any local data on clients is the same as the data on the remote file server. In the distributed filesystem biz, this is known as “consistency.”

The crudest way to do this is to lock a file on a server when a client
requests it. This means that only one client can check out a file at any given time. Only slightly less crude is separating these into read and write locks. Any number of clients can request a copy of a file to read it, but only one client can request a copy of the file to be able to write to it. This almost guarantees consistency problems (assuming that the client with the write lock updates the file while other clients are viewing read-only versions of the file). How are the clients with read-only versions of the file notified when the file changes?

Distributed filesystems (where clients can edit files while disconnected from a file server, and must therefore use copies of the files that they already have) have their own sets of problems. However, for distributed filesystems with a continuous network connection to a file server, a mechanism called a “callback” provides a great solution to this problem. A callback is essentially just a software promise made by a distributed file server to notify a client whenever changes are made to a file that the client has received from a file server. If the version of the file on the file server changes, the file server breaks the callback, telling the client that it must re-retrieve the file.

The callback mechanism ensures that clients always have the most up-to- date version of a file. As explained in the Your Cache Ain’t Nothing But Trash sidebar on page 68, making sure that file servers notice when clients have updated files depends on when clients write modified files back to a file server.


AFS (which used to stand for Andrew Filesystems but is now an acronym without an expansion) was originally a 1983 research project sponsored by IBM at Carnegie-Mellon University (CMU). AFS grew to support thousands of users at CMU before it was turned into a commercial product in 1989 by Transarc Corporation (which was later acquired by IBM).

Sets of administratively-related AFS clients and servers, known as AFS cells, are located all over the world and can easily be configured to intercommunicate, exchange Kerberos authentication information, and share data over the Internet. AFS file servers provide their own volume management services and formats using system support for logical volume management (LVM) whenever available. On systems using LVM, exported disk partitions can actually consist of several server partitions that can span multiple physical devices but are exported as a single entity. (Logical Volume Management for Linux is actively under development and is a default portion of kernel versions 2.3.49 and later. See this month’s Guru Guidance on page 78.)

AFS clients pioneered persistent client-side caching to increase performance, preserve state information across system restarts, and reduce client startup time. A persistent cache is able to survive across reboots of a client. For more information about caching in distributed filesystems, see the Your Cache Ain’t Nothing But Trash sidebar on page 68.

In early September 2000, IBM announced that it was releasing the AFS source code to the open source community using the IPL (IBM Public License, which is essentially IBM’s version of the GNU Public License GPL, but protects IBM’s patents). IBM will still sell and support an official version of AFS, but the open source version will remain open source and free.

An AFS Advisory Board consisting of IBM personnel, academics, and customer representatives will consider new additions to the supported IBM version of AFS, as well as consider migrating portions of code back into the official version from the open source version. One of the biggest oversights in AFS has always been its lack of support for disconnected operations. Hopefully, this omission will soon be properly addressed by one of the new versions of AFS.


Coda is a distributed filesystem with its origin in AFS v2. It has been under development at Carnegie Mellon University since 1987. Coda shows its AFS roots in its support for persistent, client-side caching that helps minimize client restart times and server replication, while providing for effective scalability. Unlike AFS, Coda provides support for disconnected operations for mobile computing. As an academic project, the source code and binaries for Coda are freely available under a liberal license (which is primarily GPL), but portions (the libraries) are Lesser GPL (LGPL).


The Global Filesystem (GFS) is a shared-disk filesystem for clusters of Linux systems. The nodes in a GFS cluster share the same storage devices, accessing those devices at the block-level rather than at the file and directory level (as with most of the other filesystems discussed in this article). For this reason, GFS requires using low-level disk drivers rather than file-level client and server daemons. The binaries for several disk drivers are available at the GFS download URL (See Web Links, pg. 70). A GFS filesystem appears to be local to each node, with file access across the cluster synchronized by GFS.

GFS is an Open Source 64 bit SAN (Storage Area Network) filesystem with an impressive list of industry sponsors (including Veritas Software, StorageTek, NASA Ames Research Center Mass Storage Systems Group, Seagate Technologies, and EMC). SAN solutions are a hot topic in the centralized storage and distributed filesystems world because their block-level disk access makes them inherently processor-type and system independent. All actual disk access is done by the low-level drivers, which simply fetch and store data and therefore have no assumptions about data types and organization.


InterMezzo is a new distributed filesystem with a focus on high availability, flexible replication of directories, disconnected operation, and a persistent cache. InterMezzo is an open source project, currently available for Linux kernel versions 2.2 and 2.3.

InterMezzo was inspired by CMU’s Coda, but is not based on the Coda source code. One of the founders of Stelias Computing (the company that oversees the active development of InterMezzo) was the head of the Coda project at CMU for several years.

On both clients and servers, InterMezzo uses an underlying journaled filesystem that stores information about local filesystem changes in a log stored within an ext2fs filesystem. InterMezzo provides a set of wrapper functions that enable it to interoperate with other journaled filesystems (such as SGI’s XFS or IBM’s JFS).

The TurboLinux, Red Hat, Mandrake, and SuSE Linux distributions have all expressed interest in InterMezzo, but are not, at this point, active users of this new distributed filesystem.


NFS is a distributed filesystem that provides transparent access to files residing on remote disks. Developed at Sun Microsystems in the early 1980s, the NFS protocol has since been revised and enhanced a number of times. It is available on all Unix systems and even for Microsoft Windows. The specifications for NFS have been publicly available since its inception, which is what allowed NFS to become the de facto standard distributed filesystem.

Despite its use everywhere, NFS has some problems. First, it is stateless. This means that clients of NFS servers do not maintain a cache that provides useful information across reboots of a client. Next, NFS performance is average at best. Recent versions of Linux (using kernel versions 2.2 or later) have provided a knfsd (an NFS daemon that largely runs within the Linux kernel). This provides enhanced performance over standard NFS daemons. Finally, NFS has primitive locking mechanisms that are not very useful if many client applications are simultaneously accessing data files.

NFS is unquestionably the most popular distributed filesystem around, because it works with almost every operating system, is easy to set up and administer, and provides a better solution than the alternative of not sharing files.

Your Cache Ain’t Nothing But Trash

Figure Three: Using callbacks to notify clients of cached file updates improves performance.

Actually, we hope not. A cache is the portion of the client’s local disk that contains copies of the files and directories that the client has requested from file servers. Whenever an application running on the client requests a file that is actually located on a remote file server, the request first passes through the cache. If the file is not already located in the cache, or if it is there but the callback to the file server has been broken (see the previous sidebar), the client requests the file from the server again. A new version of the file is stored in the cache, and a pointer to it is returned to the application that requested it.

Caching improves the speed and efficiency of accessing files that are actually located on a distributed file server. If multiple applications or users repeatedly ask for data from the same file, that file is already in the cache on the local disk. This reduces both the load on the network and the time it takes to satisfy users’ requests for that file. Whenever a user modifies the cached file or directory, the callback between the client and the file server is broken, and the file server retrieves the updated version from the client that made the changes. The server then breaks the callback between itself and any other clients that have older cached copies of the file. (Warning: this is an oversimplification meant to introduce you to the notion of how caching can improve performance and is not designed to replace an upper-level course in computer science!) Figure Three shows the same thing as Figure Two (pg. 64), but with a cache added appropriately.

Another way in which caching data from file servers can improve performance is by preserving and re-using the cache when you restart your system. The theory here is that users of a client will probably resume working on the same files that they were previously working on. If those files are still in the cache and haven’t changed on the file server (i.e., the callbacks can be re-established), then the client can use the contents of the cache without retrieving the files from the file server. This is known as a “persistent” cache, not because it’s irritating or obnoxious, but because its contents persist and are often reusable across system reboots. In the absolute worst case, your cache ain’t nothing but trash, and every file in the cache must be discarded. In this case, the client doesn’t retrieve anything, but simply discards the current contents of the cache and proceeds normally, retrieving and caching any files that users of the client request.

NFS uses a cache, but its cache is designed to allow client systems to continue while files are being saved back to the file server. An NFS cache does not persist across reboots. Distributed filesystems such as AFS, CODA, and InterMezzo make heavy use of caching to improve performance when they can communicate directly with a file server. When CODA or InterMezzo is running disconnected from a file server, it depends on the data in the cache. Once it can reconnect to the file server, it resynchronizes the cache with the data on it. This topic is extremely complex. See Web Links (pg. 70) for more information.

Summary And Conclusions

Distributed filesystems are worth their weight in gold to both users and system administrators. NFS is still the most popular distributed filesystem available today, because it is free and is delivered with most types of systems. However, the variety of distributed filesystems available for Linux under the GPL or similar open-source licenses is about to change all of that. Being able to install and run AFS, CODA, or InterMezzo on your Linux system can provide substantial performance and usability improvements over NFS. Thanks to Linux and the open source movement, you no longer have to accept the limitations of NFS, using it simply because it is there.

Managing more advanced distributed filesystems than NFS requires some administrative overhead. For example, installing and using AFS requires learning AFS-specific commands and backup procedures. All distributed filesystems, even NFS, require some administrative configuration on both clients and servers. All of the advanced distributed filesystems discussed in this article require installing kernel patches or compiling and installing loadable kernel modules, which is not necessarily for the faint-of-heart or the novice administrator. As binary RPMs for these filesystems become available, installing and using modern distributed filesystems will become even easier.

Finding the Right Filesystem

FilesystemSupported PlatformsSecurity SystemScalabilityClient Cache LocationPersistent CacheAvailabilityLicensing
AFSAll Unix Platforms, Red Hat Linux, Windows 95, 98, NT, and 2000KerberosExcellentlocal disk directoryYesCurrentIBM Public License
CODALinux, NetBSD, FreeBSD, Windows 95, Windows NTKerberosYeslocal disk directories or partitions YesCurrentGPL, Lesser GPL
InterMezzoLinux Kernels v2.2 or betterNONEYeslocal disk directories or partitionsYesCurrentGPL
NFSAll Unix Platforms and Linux distributions, Windows 95, 98, NT, and 2000yp/NIS, NIS+Acceptablelocal disk directoryNoCurrentLinux versions GPL, others proprietary but w/ open specs

Web Links





NFS for Linux (NFS and KNFS)

Installing And Configuring an InterMezzo Server And Client

Using Stelias Computing’s InterMezzo distributed file system as an example, this sidebar explains how to actually get a modern distributed filesystem up and running. We selected InterMezzo for a variety of reasons. It is available under the GPL, its lento cache manager is a Perl program, it is simpler to configure than most distributed filesystems, and it is more actively under development than some of the other distributed filesystems discussed in this article. This HOWTO sidebar uses Stelias’ 0.02 software release as an example, but the 0.9 release of InterMezzo is nearing final release candidacy.


You’ll need at least two computers running Linux, one to act as an InterMezzo file server and at least one to act as a client of that server. When using the 0.02 release of InterMezzo, these should all be uni-processor systems; even multi-processor systems running in single-processor mode may use untested code paths through the kernel (as we learned).

You must have the Linux kernel sources installed in /usr/src/linux (which can be a symlink, as it normally is) on each machine where you will install InterMezzo. The kernel configuration file (.config) in that directory on each machine must be correctly configured to match the kernel that is running on that system. The kernel shipped with Red Hat 6.2 was release 2.2.14-5.0, which we recommend using if you want to guarantee that things work for you exactly as described in this HOWTO.

For optimal experimentation with InterMezzo, you’ll need to have a spare partition on each client and server. The examples later in this HOWTO explain how to experiment with InterMezzo even if you don’t have a spare partition available; but, it would be better if you did.

If you don’t have a spare partition on your server and client(s) that you can dedicate to InterMezzo, the kernel running on each of your systems must be compiled with support for the “loopback device.” This is selectable using the Block Devices -> Loopback Device Support option from make menuconfig or make xconfig. I prefer to build this into the kernel rather than making it a loadable module. If you need to rebuild your kernel to support this device, you’ll need to run make dep and then make install to rebuild and install the new kernel.

STEP 1: Get the Source, Luke! Retrieve the Following Packages:

The first three are fixes and/or enhancements to Perl itself; the fourth is the source code for Intermezzo 0.02.

STEP 2: Build and Install InterMezzo and Perl Enhancements

Make a directory somewhere; make that directory your working directory and extract the contents of each of these using a command like:

 gzip -cd filename | tar xvf ->

Use the su command to become the root user on your system. Next, cd to each of the directories containing the Perl enhancements. Make and install the software using the following commands:

perl Makefile.PL
make install

Finally, cd to the intermezzo directory (created for you when you unpacked the file intermezzo-0.02-0.tgz). Make and install the software using the commands:

make all
make install

STEP 3: Configure Your System for InterMezzo

Modify your system’s kernel module configuration using the following commands:

 echo “alias char-major-185 presto” >> /etc/conf.modules
echo “alias InterMezzo presto” >> /etc/conf.modules

Next, configure your system’s module dependencies correctly:

 /sbin/depmod -a

STEP 4: Reboot

If you have rebuilt your kernel to add loopback support, you’ll need to reboot. You’ll also want to verify that you haven’t accidentally broken anything (such as any other aspect of network support) while building the new kernel. In general, rebooting after reorganizing module dependencies and configuration is a good idea.

STEP 5: Create the InterMezzo Configuration Files on Each System

After rebooting, log in as root. Create the directory /etc/intermezzo. You will create three files in this directory on your server and on each client.


This file contains a single line that lists some information – the name of the system, the name of the device used for InterMezzo, and the system’s IP address. For example, on my server, distfs, this line looks like:

 distfs /dev/intermezzo0

You’ll get the name of the InterMezzo device from the /etc/fstab entry that you’ll create later in this section.


This file is a database of key/value pairs (“hashes” in Perl speak) that identify the InterMezzo file servers in your environment. In my case, this is:

distfs => {
ipaddr => “″

Like the serverdb file, this file is a database of key/value pairs. The pairs identify the servers and clients that replicate files and folders from the InterMezzo device on the server (and are therefore known as “replicators”). In a one client (test), one server (distfs) setup, where the name of the shared volume is shared, this would look like the code at left. If you’re adding multiple clients, the name of each client must be enclosed within quotation marks and separated from the next by a comma.

shared => {
servername => “distfs”,
replicators => [ "test" ]


<< prev   page 1 2 3 4 5 6   next >>


Linux Magazine /
November 2000 / FEATURES
Distributed Filesystems for Linux

STEP 6: Create and Mount the Necessary

Partitions Or Loopback Devices

Next, on each system, create the devices that will cache the shared files and directories. If you have a free partition on each client and server dedicated to InterMezzo, format it as an ext2fs partition using a command like:

 /sbin/mke2fs /dev/hdb2

/dev/hsb2 is only an example! Before executing this command, make sure that the name of the partition is correct for the free partition you want to use on your system and is the one that you want to create a file system on! There is absolutely no way to recover from this command.

If you do not have a dedicated partition to use for testing InterMezzo, you can create a small cache partition in /tmp using the loopback device through the following commands:

dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
/sbin/losetup /dev/loop0 /tmp/fs0
/sbin/mke2fs /dev/loop0
/sbin/losetup -d /dev.loop0

Next, create a directory to use as a mount point for this partition at /izo0, which is identified in the /etc/fstab entry that we’ll create next.

Create an entry in /etc/fstab for the InterMezzo partition. For a loopback device, this would be:

/tmp/fs0 /izo0 Intermezzo \
loop,volume=shared,prestodev=/dev/intermezzo0,mtpt=/izo0,noauto 0 0

Note that the long block of arguments must be comma-separated but cannot contain white space, or they will be interpreted as multiple arguments.

To use a physical file system for the InterMezzo file cache, use a command like the following:

/dev/hdb2 /izo0 Intermezzo \
volume=shared,prestodev=/dev/intermezzo0,mtpt=/izo0,noauto 0 0

The name of the volume must match the name of the volume listed in your /etc/intermezzo/voldb file. The name of the InterMezzo device (prestodev) must match the name of that device listed in your /etc/intermezzo/ sysid file. The system name and IP address must match systems listed in your /etc/intermezzo/serverdb file, and one must be the name of the system in your /etc/intermezzo/sysid file.

Finally, mount the InterMezzo cache partition using the command:

mount /izo0

STEP 7: Verify Your InterMezzo Configuration

InterMezzo includes a script called config_check, located in the tools subdirectory of the directory where you installed and built InterMezzo 0.02. You should run this tool to verify that your configuration files are correct before starting InterMezzo. Since we’ve used the default locations for files throughout this HOWTO section, you don’t need to supply any arguments to the script. Check its output carefully to make sure that all of the InterMezzo configuration files are correct and consistent with each other.

STEP 8: Start the lento Cache Manager on Each System

To start InterMezzo, you must now simply start its cache manager, lento.pl. To do so, cd to the lento subdirectory in the directory where you installed and built the InterMezzo 0.02 source code. Start the lento cache manager by executing the command:

./lento.pl >& lento.out &

At this point, you will be able to execute commands like ls /izo0 to see that it is a real filesystem that currently only contains the standard lost+found directory.

STEP 9: Experiment!

Now that your systems are running InterMezzo, you can copy files or folders to /izo0 on the client or the server, and they will be quickly replicated to the other system. Since InterMezzo 0.02 is a very early release of InterMezzo, you should only modify replicated files or directories on clients. Those changes will be migrated back to the server. If you modify replicated files and directories on both the client and the server, InterMezzo cannot synchronize them, because it does not yet support “merge” features à la CVS for text files. You can test InterMezzo’s support for disconnected operations by disconnecting the network cable from your clients, modifying files in /izo0 on those clients, then reconnecting your system to the network and watching the changes on the client propagate back to the server.

InterMezzo 0.02 has some limitations, primarily the absence of a distributed authentication mechanism, which means that you need to be logged in as root in order to modify files in the /izo0 cache. It is an experimental filesystem, so you should be careful to frequently back up files in izo0 to a location in a standard, non-replicated partition. You should always back them up before shutting down a disconnected system, as you may need to repopulate the cache each time you restart the lento cache manager. InterMezzo’s support for disconnected operations is continually being refined and enhanced.

Bill Von Hagen is president of Move2Linux.com. He can be reached at wvh@movetolinux.com.

Comments are closed.