From Russia with Love: POHMELFS – A New Distributed Storage Solution

There is a new file distributed file system in the staging area of the 2.6.30 kernel called POHMELFS. Sporting better performance than classic NFS, it's definitely worth a look.

Distributed storage solutions are almost ubiquitous today. They are used in HPC systems, corporate desktops, corporate laptops, and even typical laptop and home users are starting to use servers to provide centralized storage for their homes and families. The most common file system used in these situations is NFS. It has been in use for many years, comes with virtually every OS, is well understood, and it just works. In addition, it’s the only standard file system. This allows you to use a single central server for many different Operation Systems (OS).

However, NFS is not without it’s limitations. Evgeniy Polyakov, a long time Linux hacker, has recently contributed a new distributed file system, called POHMELFS (Parallel Optimized Host Message Exchange Layered File System). It has appeared in the “chock-full-of-filesystems” kernel version 2.6.30 in the staging area. It is ready for testing and can give you a boost in performance (remember – it’s parallel!). This article will discuss POHMELFS and where it is headed.

An Oldie but a Goody – NFS

NFS has been the dominant file system protocol for distributed storage needs because it is “there” and is pretty much “plug and play” on most *nix systems. It was the first widespread file system that allowed distributed systems to share data effectively. In fact, it’s the only standard network file system.

While NFS is likely to be the most ubiquitous distributed file system, it has gotten a little long in the tooth, so to say, and has some limitations. For example, it doesn’t scale well for large number of clients and has limited performance. It also used to have some security issues, but these were addressed in Version 4 of the NFS protocol. Despite these limitations, it remains the most popular distributed file system because:


  • It comes with virtually every OS (it can even be an add-on to Windows)
  • Easy to configure and manage
  • Well understood
  • Works across multiple platforms
  • Usually requires little administration (until it goes down)
  • It just works

NFS is a fairly easy protocol to follow. All information, data and metadata, flows through the file server. This is commonly referred to as an “in-band” data flow model shown in Figure One below.

Figure 1 - In-Band Data Flow Model (Courtesy of Panasas)
Figure 1 – In-Band Data Flow Model (Courtesy of Panasas)

Notice that the file server touches and manages all data and metadata. This model does make things a bit easier to configure and monitor. Moreover, it has narrow, well defined failure modes. Some drawbacks include an obvious bottleneck for performance, has problems with load balancing, and security is a function of the server node and not the protocol (this situation means that security features can be all over the map).

With NFS, at least one server “exports” some storage space to the nodes of the cluster. These nodes mount the exported file system(s). When a file request is made to one of these mounted file systems, the mount daemon transfers the request to the NFS server, which then accesses the file on the local file system. The data is the transferred from the NFS server to the requesting node, typically using TCP, but can be UDP. Notice that NFS is “file” based. That is, when a data request is made, it is done on a file, not blocks of data or a byte range. So we say that NFS is a file based protocol.

For typical NFS systems (not clustered NFS), all metadata and data operations go through a single server. As you increase the number of clients that are addressing the server, the more load the server must carry. Consequently, NFS can have limited performance (it depends upon the number of clients accessing the storage, the workload, the performance capabilities of the server, and the network).

Introduction to POHMELFS

POHMELFS is a new file system that is designed to take a step beyond classic NFS focusing on improved performance. It’s name even contains the term “Parallel” that indicates that the clients interact with multiple servers. In particular, it has the ability to balance reading from multiple servers and also do simultaneous writes to different remote servers. It is designed in the classic server-client model. As with NFS, POHMELFS exports a directory from each server so it relies on an underlying file system to read and write data to the physical devices themselves. In addition, it is designed as an object based file system (more on that below).

The POHMELFS website has a very comprehensive list of features. For the sake of completeness these features are summarized here:


  • One of the most important attributes is the ability to write to multiple servers and balance reading between multiple servers.

  • POHMELFS has a local coherent cache for data and metadata (this basically adds some of the features of FS-Cache and CacheFS to the network file system.)

  • It includes locking (a necessary feature for a shared file system). It was originally designed for byte-range locking but according to the website, all Linux file systems lock the whole inode. So the developers decided to lock the whole object during writing. But POHMELFS has the ability to allow different clients to simultaneously write into the same page via different offsets with the result that the file will be coherent on all clients and all servers (not a small feat).

  • All events are completely asynchronous with the only exceptions being hard links and symlinks. These events include the creation of the objects as well as data reading and writing.

  • POHMELFS is designed to have a flexible object architecture that is optimized for network processing. Network processing is a potential weak point for distributed file systems since file systems can be “chatty” and create many small messages that are not always optimal for networks. The design of the object architecture allows for very long paths to the objects and the ability remove arbitrary size directories with a single network command.

  • The server portion is multi-threaded and scalable and, perhaps more importantly, is in user space. There is only a driver for POHMELFS in the kernel. The client and the server are all in user-space and interact with the driver. Assuming that the driver does not change very much, then as POHMELFS evolves only the user-space tools evolve. Consequently, new evolutions don’t require new kernels. This also means that development can progress at a very fast rate.

  • POHMELFS utilizes a transaction model for all its operations. Each transaction is an object which may embed multiple commands that are to be completed atomically. This design also means that it will resend transactions to different servers if there is a timeout or an error on the initially contacted server. This design maintains high data integrity and does not desynchronize the file system state in the event of a server failure or a network failure. An end result of this design is that if a server goes down the clients can switch to a different one automatically.

  • It has the ability for the clients to dynamically add or remove servers from a working set.

  • POHMELFS is also designed for strong authentication with the possibility of data encryption in the network channel.

  • It has extended attribute support

  • It can do read-only mounts and also has the ability to limit maximum size of the exported directory.

Comments on "From Russia with Love: POHMELFS – A New Distributed Storage Solution"

mark_w

That\’s a relief; I was starting to think that there wouldn\’t be a new filesystem this week.

Reply
laytonjb

LOL!!! Just wait – there are still a few file systems to go :)

Is there anything you want to hear about in the storage/file system?

Thanks!

Jeff

Reply
dwolsten

Sounds great, but I think it needs a name other than \”POHMELFS\”. Maybe if a couple of letters could be removed, perhaps \”POMEFS\” or \”PELFS\”. File systems need to have short, simple names: NFS, AFS, XFS, JFS, ext2, ext3, ext4, etc. File systems with long, unweildy names don\’t seem to catch on.

Reply
sergeyfd

POHMELFS name makes a very interesting sense in Russian :-) Pohmel – is a feeling you have next morning after a good party with a lot of alcohol :-)

Reply
sc00bs

a catchy abbreviated name – PMS maybe :-)

Reply
normalex

OMG! The name is so funny, LOL.
The English interpreted name is HANGOVERFS :)

Reply
jpaugh64

This new filesystem intrigues me; I\’ll keep it in mind when I start building my own clusters.

I personally find the meaning of the acronym to be hilarious. I think it should remain. It makes for a bad acronym, but would be a great nickname.

Reply

I want to have some more details about the architecture technical of pohmelfs ; what is elliptic network ; can you describe it with a figure

Reply

An impressive share, I just given this onto a colleague who was doing just a little analysis on this. And he in truth purchased me breakfast since I found it for him.. smile. So let me reword that: Thnx for the treat! But yeah Thnkx for spending the time to talk about this, I feel strongly about it and really like reading extra on this subject. If probable, as you develop into expertise, would you mind updating your weblog with even more details? It really is highly helpful for me. Massive thumb up for this blog post! allourweb.

Reply

My coder is trying to persuade me to move to .net from PHP.
I have always disliked the idea because off the costs.
But he’s tryiong none the less. I’ve been usibg Movable-type on numerous
websites for about a year and am concerned aabout switching to another
platform. I have heard good things aboput blogengine.net.
Is there a way I can import all my wordpress posts into
it? Any help would be greatly appreciated!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>