Cool User File Systems: GlusterFS

One the coolest file systems in User Space has got to be GlusterFS. It has a very unique architecture that allows it to be configured for specific storage requirements and scenarios. It can be used as a high-performance parallel file system, or a cloud based file system, or even a simple NFS server. All of this in user-space. Could GlusterFS represent the future of file system development for Linux?

Introduction

I’ve been talking about user-space file systems for several articles now. The concept of being able to quickly create a file system using almost any language you want using FUSE (File System in Userspace) libraries and kernel module is a very powerful one (however I’m still waiting on the Fortran bindings). One can build a file system that meets a particular set of requirements without having to develop and maintain kernel patches for a long period of time, without having to ask testers to apply the kernel patches and test, and then going through the kernel gauntlet. It can be developed quickly, with a variety of languages, get immediate feedback from testers, does not have to be tied to a particular kernel release, does not require a kernel patch and/or rebuild.

GlusterFS is a very sophisticated GPLv3 file system that uses FUSE. It allows you to aggregate disparate storage devices, which GlusterFS refers to as “storage bricks”, into a single storage pool or namespace. It is what is sometimes referred to as a “meta-file-system” which is a file system built on top of another file system. Other examples of meta-file-systems include Lustre and PanFS (Panasas’ file system). The storage in each brick is formatted using a local file system such as ext3 or ext4, and then GlusterFS uses those file systems for storing data (files and directories).

Arguably one of the coolest features of GlusterFS is the concept of “translators” that provide specific functionality such as IO schedulers, clustering, striping, replication, different network protocols, etc. They can be “stacked” or linkned to create a file system that meets your specific needs. Using translators, GlusterFS can be used to create simple NFS storage, scalable cloud storage with replication, or even High-Performance Computing (HPC) storage. You just create a simple text file tells GlusterFS the translators you want for the server and the client along with some options and then you start up GlusterFS. It’s that simple.

But getting the translators in the proper order with the proper functionality you want can sometimes be a long process. There is a company, www.gluster.com, that develops GlusterFS and also provides for-fee support for GlusterFS. They have also taken GlusterFS and combined it with a simple Linux distribution to create a software storage appliance. You just pop it into a server that either has some attached storage or SAN storage, and you can quickly build a storage cluster.

GlusterFS is arguably the most sophisticated FUSE based file system with a great deal of capability. Let’s take a look at it to understand the capabilities and how it can be used.

GlusterFS

The current version of GlusterFS is 3.0.5 (July 8, 2010) but it has evolved over time. To properly discuss GlusterFS I think we need to jump back to the versions prior to 2.0.8. In these versions creating the file system “translator stack” was a more manual process that took some time and experimentation to develop.

GlusterFS Before 2.0.8

For versions prior to 2.0.8 of GlusterFS and earlier (basically version 2.0.7 on down), configuring GlusterFS required more manual work but also offered the opportunity for highly tuned configurations. GlusterFS used the concept of a stackable file system where you could “stack” capabilities in some fashion to achieve the desired behavior you want (that may sound vague but keep reading because it gets more specific). In particular, GlusterFS uses translators which each provide a specific capability such as replication, striping, and so on. So connecting or stacking translators allows you to combine capabilities to achieve the design you want. Let’s examine version 2.0.7 and how one can build a GlusterFS file system.

GlusterFS begins with the concept of a storage brick. It is really a sever that is attached to a network with some sort of storage either directly attached (DAS) or has some storage via a SAN (Storage Area Network). On top of this storage you create a local file system using ext3, ext4, or another local Linux file system (ext3 is the most commonly used file system for GlusterFS). GlusterFS is a “meta-file-system” that collects these disparate file systems and uses them as the underlying storage. If you like analogies, these local file systems are the blocks and inodes of GlusterFS (Note: there are other meta-file-systems such as Lustre).

GlusterFS allows you to aggregate these bricks into a cohesive name space using the stacked translators. How you stack the translators, what network protocols you use, how you select the storage bricks, how you create the local file systems, all contribute to the capacity, performance, and manageability of GlusterFS. If you haven’t read between the lines, one can easily say theat the “devil is in the details” so let’s start with system requirements and configuration details for GlusterFS.

Recall that GlusterFS is in user-space so at the very least you’ll need a kernel on both the storage servers and the clients that is “FUSE-ready”. You also need to have libfuse installed, version 2.6.5 or newer. The GlusterFS User Guide suggests that you use Gluster’s patched FUSE implementation to improve performance.

If you want to use InfiniBand then you’ll need to have OFED or an equivalent stack installed on the servers and the clients. If you want to improve any web performance you’ll need mod_glusterfs for Apache installed. Also, if you want better small file performance, you can install Berkeley DB (it uses a distributed Berkeley DB backend). Then finally you’ll need to download GlusterFS – either in binary form for a particular distribution or in source form. This web page gives you details on the various installation options.

Assuming that we have all the software pieces installed on at least the servers the next step is to configure the servers. The configuration of GlusterFS is usually contained in /etc/glusterfs. In this directory you will create a file that is called a volume specification. There are two volume specification files you need to create – one for the server and one for the client. It’s a good idea to have both files on the server.

The volume specification in general is pretty simple. Here is an example from the Installation Guide

volume colon-o
 type storage/posix
 option directory /export
end-volume

volume server
 type protocol/server
 subvolumes colon-o
 option transport-type tcp
 option auth.addr.colon-o.allow *
end-volume


The first section of the volume specification describes a volume called “colon-o”. It uses the POSIX translator so that it is POSIX compliant. It also exports a directory, /export.

The second part of the volume specification describes the server portion of GlusterFS. In this case it says that this volume specification is for a server (type protocol/server). Then it defines that this server has a subvolume called “colon-o”. The third line, after defining the volume, states that the server will be using tcp. And finally the line “option auth.addr.colon-o.allow *” allows any client to access colon-o.

After creating the server volume specification file, the next step is to define the client volume specification. The Installation Guide has a sample file that is reproduced here.

volume client
 type protocol/client
 option transport-type tcp
 option remote-host server-ip-address
 option remote-subvolume colon-o
end-volume


The file defines a volume called “client” and states that it is a client (“type protocol/client”). It uses tcp (next line down), and then defines the IP address of the server (just replace “server-ip-address” with the address of the particular server). Then finally it states that it will use a remote subvolume named colon-o.

Once these files are created, then you just start GlusterFS on the server and you start GlusterFS on the clients. The commands are pretty simple – on the server the command is,

# glusterfsd -f /etc/glusterfs/glusterfsd.vol


where /etc/glusterfs/glusterfsd.vol is the volume specification created for the server.

On the client, the command is fairly similar,

# glusterfs -f /etc/glusterfs/glusterfs.vol /mnt/glusterfs


where /etc/glusterfs/glusterfs.vol is the name of the client volume specification file (be sure it is on every client or it is in a common name space shared by all clients – perhaps a simple NFS mounted directory). The second argument to “glusterfs” is the mount point, /mnt/glusterfs. Be sure this mount point exists before trying to mount the file system.

For configuring and starting GlusterFS on a cluster you can use a parallel shell tool such as pdsh to create the mount point on all of the clients and then run the “glusterfs” command to mount the file system on all of the clients.

As I mentioned previously, there are a large number of translators available for GlusterFS. These translators give GlusterFS the ability to tailor the file system to achieve specific behavior. The list of translators is fairly long but deserves to be listed to show the strength of GlusterFS and perhaps more importantly, user space file systems.


  • Storage Translators: (define the behavior of the back-end storage for GlusterFS

    • POSIX – Tells GlusterFS to use a normal POSIX file system as the backend (e.g. ext3)
    • BDB – Tells GlusterFS to the Berkeley DB as the backend storage mechanism. It uses key-value pairs to store data and uses POSIX directories to store directories.

  • Client and Server Translators: These translators “… export a translator tree over the network or access a remote GlusterFS server.”

    • Transport Modules:

      • TCP Forces TCP to be used between client and server
      • IB-SDP This translator forces GlusterFS to use a socket interface for ib hardware. SDP is over ib-verbs.
      • ibverbs According to the GlusterFS documentation, “The ib-verbs transport accesses the InfiniBand hardware through the verbs API, which is the lowest level of software access possible and which gives the highest performance. On InfiniBand hardware, it is always best to use ib-verbs. Use ib-sdp only if you cannot get ib-verbs working for some reason.”

    • Client Protocol The client protocol translator allows the GlusterFS client to access a server’s translator tree (stack).
    • Server Protocol The server protocol translator exports the translator tree and makes it accessible to the GlusterFS clients.

  • Clustering Translators: These translators are used to give GlusterFS the ability to use multiple servers to create clustered storage. These translators are key to defining the basic behavior of a GlusterFS file system.

    • distribute: This translator aggregate storage from several storage servers.
    • unify: This translator takes all of the sub-volumes from the storage servers and make them appear as a single volume (i.e. it unifies them). One key feature of this translator is that a particular file can exist on only one of the sub-volumes in the storage cluster. The unify translator also uses the concept of a “scheduler” to determine where a file resides:

      • ALU: This stands for “Adaptive Least Usage” and causes GlusterFS to balance the “load” across volumes. The load is defined by “sub-balancers”. These sub-balancers can be arranged in order of importance for load balancing to create quite sophisticated behavior.

        • disk-usage: Watches the free and used disk-space on the volume
        • read-usage: Watches the amount of reading done from this volume
        • write-usage: Watches the amount of writing done from this volume
        • open-files-usage: Watches the number of open files from this volume
        • disk-speed-usage: The speed at which disks are spinning. This is almost always a constant so it’s not very useful.

      • RR (Round Robin): Creates files in a round-robin fashion on the volumes. Each client has it’s own round-robin loop.
      • Random: Randomly selects a node for storing the file.
      • NUFA” Non-Uniform File Allocation scheduler allows files to be created locally if the local client is also a local server.
      • Namespace Volume
      • Self Heal

    • Replicate: This scheduler replicated files and directories across the sub-volumes. If there are 3 subvolumes then a copy of each file/directory will be on each subvolume. Also if a downed storage node returns to service it will be updated from the other nodes. This scheduler has additional features:

      • File self-heal: Defines the file self-healing characteristics
      • Directory self-heal: Defines the directory self-healing characteristics

    • Stripe: Distributes the contents of a file across subvolumes.

  • Performance Translators:

    • Read ahead: Caches read data before it is needed (pre-fetch). Typically this is data that appears next in the file.
    • Write Behind: Allows the write operation to return even if the operation hasn’t been completed (helps latency of write operations).
    • IO Threads: Performs file IO (read/write) in a background thread.
    • IO Cache: Caches data that has been read.
    • Booster: Allows applications to skip using FUSE and access the GlusterFS directly. This typically increases performance.

  • Features Translators:

    • POSIX Locks: This feature translator provides storage independent POSIX record locking support (i.e. fcntl locking).
    • Fixed ID: According to the GlusterFS guide, “The fixed ID translator makes all filesystem requests from the client to appear to be coming from a fixed, specified UID/GID, regardless of which user actually initiated the request.”

  • Misc Translators:

    • rot13: This is a translator that shows how to do encryption within GlusterFS using the simple rot-13 encryption scheme (if you can call rot-13 encryption).
    • trace: Used for debugging.


it’s pretty obvious that you can develop a storage system to fit almost any behavior. The User’s Guide has a much more extensive discussion of the translators. There are even examples for different types of behavior.

But the power of GlusterFS, it’s configurability through the numerous translators, can also make it difficult to setup. What’s the proper order for the translators? Which translators are better on the client and which one’s are better on the server? What is the order of translators for the best performance or best reliability or best capacity utilization? In the next generation of GlusterFS, the developers have made installation and configuration a bit easier.

GlusterFS From 2.0.8 to 3.0

This version of GlusterFS has some of the basic system requirements of the earlier versions:


  • Typical Linux hardware (x86_64 servers) that have direct attached storage, SAN storage, or some combination.
  • A network connecting the servers and clients. The network can be GigE, 10GigE, or InfiniBand.

There are pre-built binaries for various Linux distributions listed here for GlusterFS itself (server, client, and common files). You can also build it from source if you like.

In this version, GlusterFS is still a meta-file-system so it’s built on top of other local file systems such as ext3 and ext4. However, according to the GlusterFS website, xfs works but has much poorer performance than other file systems. So be sure to build the file systems on each server prior to the configuring and starting GlusterFS.

Recall that after GlusterFS is installed on the servers and the clients the next step is to create the volume specification files on the server. Prior to version 2.0.8 we had to create these files by hand. While not difficult, it was time consuming and errors could easily have been introduced. Starting with version 2.0.8 and extending into version 3.x of GlusterFS, there is a new command, glusterfs-volgen that creates the volume specification file for you. A simple example from the Server Installation and Configuration Guide illustrates how to do this.

# glusterfs-volgen --name store1 hostname1:/export/sdb1 hostname2:/export/sdb1 \
hostname3:/export/sdb1 hostname4:/export/sdb1


The options are pretty simple: “–name” is the name of the volume (in this case “store1″). After that is a list of the hosts and their GlusterFS volumes that are used in the file system.

For this particular example, a total of 4 files are created by the glusterfs-volgen command.

hostname1-store1-export.vol
hostname2-store1-export.vol
hostname3-store1-export.vol
hostname4-store1-export.vol

store1-tcp.vol


The first four files are for the servers (you can pick out which file belongs to which server) and the fifth file is for the clients.

This example creates a simple distributed volume (i.e. no striping or replication). You can create those volumes as well with some simple additional options to glusterfs-volgen.


  • Replicated Volume: # glusterfs-volgen –name repstore1 –raid 1
  • Striped Volume: # glusterfs-volgen –name stripe1 –raid 0

You can also tell it to use InfiniBand Verbs as well. The details are contained on the “glusterfs-volgen” reference page.

One the volume specification files are created by “glusterfs-volgen” then you can copy them, using something like scp, to the appropriate server. But you will also need to copy the client file to all clients or you can use a nifty new feature of GlusterFS to allow each server to pull the correct file. The following command

# glusterfs --volfile-server=hostname1 /mnt/glusterfs


tells GlusterFS where to get the volume specification file and where to mount GlusterFS (just be sure the mount point exists on all clients before using this command). This command will look for the file, /etc/glusterfs/glusterfs.vol so you can either copy the client file to this file on hostfile1 or you can symlink the client file to it.

At this point we’ve configured the servers and we can start glusterfs on each one as we did before.

# glusterfsd -f /etc/glusterfs/glusterfsd.vol


where /etc/glusterfs/glusterfsd.vol is the volume specification created for the server. On Redhat style systems you can also used the command,

# /etc/init.d/glusterfsd [start|stop]


which looks for the file, /etc/glusterfs/glusterfsd.vol on each server. Be sure this file is the correct one for each server.

The client portion of GlusterFS is just as easy as the server. You download the correct binary or you build it from source. The next step is to actually mount the volume you want on the client. You need a client volume specification file on each client before trying to mount GlusterFS. Previously it was mentioned that it’s possible to have the client pull the client volume specification file from a server. Alternatively you could just copy the the client volume specification file from the server to every client using something like pdsh. Regardless, the .vol file needs to be on every client as /etc/glusterfs/glusterfs.vol.

You can mount glusterfs using the normal “mount” command with the glusterfs type option.

# mount -t glusterfs hostname1 /mnt/glusterfs


Or you can put the mount command in the /etc/fstab file just like any other file system.

GlusterFS can also use Samba to re-export the file system to Windows clients. But one aspect that many people are not fond of is that GlusterFS requires that you use a user-space NFS server, unfs3, for re-exporting GlusterFS over NFS. You cannot use the kernel NFS server to re-export GlusterFS – you have to use unfs3. You can use any NFS client you wish, but on the server you have to use unfs3.

GlusterFS – The Model for Future File System Development?

The last several articles I’ve been talking about user-space file systems. While good, stable, and useful file systems are notoriously difficult to write, it is perhaps more difficult to get a new file system into the kernel for obvious reasons. FUSE allows you to write a file system in user space which has all kinds of benefits – faster release of code, no kernel recompiles, languages other than C can be used (I’m still waiting for the Fortran bindings). But is it worthwhile to write very extensive file systems in FUSE?

GlusterFS is an example of how much you can achieve by writing file systems using FUSE. It is likely the most configurable file system available with many options to achieve the behavior you want. The concept of stackable translators allows you to tune the transport protocol, the IO schedulers (at least within GlusterFS), clustering options, etc., to achieve the behavior, performance, and capacity you want (or need). Even better, you could always write a translator to give you a specific feature for you application(s). I would bet big money that you could never get something like this into the kernel – and why would you?

Keeping the file system in user space allows developers to rapidly update code and get it in the hands of testers. If the file system was in the kernel, the pace of release would be much longer and the pace of testing could possibly slower. Who wants to roll out a new kernel to test out a new file system version? Almost everyone is very conservative with their kernel and rightful so.

GlusterFS is a very cool file system for many reasons. It allows you to aggregate and use disparate storage reasons in a variety of ways. It is in use at a number of sites for very large storage arrays, for high performance computing, and for specific application arrays such as bioinformatics that have particular IO patterns (particularly nasty IO patterns in many cases). Be sure to give it a try if it suits your needs.

But I can’t help but wonder if GlusterFS represents the future of file system development for Linux?

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62