HPC Smackdown: What is the Best Clustered File System?

Distributed vs. parallel? Emerging vs. stable? Join the discussion.

File systems, now there is a can a worms. Everyone has their favorite, yet no file system seems to do it all. There is also confusion about distributed file systems (like NFS) vs parallel file systems (like GPFS, Lustre, etc). And, what about the upcoming pNFS (parallel NFS) standard?

In case you need some background, check out our series on parallel file systems. (See part 1, part 2, part 3.)

So what is your opinion?

Comments on "HPC Smackdown: What is the Best Clustered File System?"


I completely agree, it is a can or worms–there is no single cluster file system today that solves all problems for every user. And no, this is not a rote fall-back to the HPC mantra “It Depends(TM)”, nor is this my favorite v. your favorite. The greatest impediment to anointing a One Best Cluster Filesystem are the diverse and usually contradictory requirements placed upon them.

In my own experience, I have yet to find any single cluster file system that was objectively close to “is usually a good answer”. A couple of years ago, we considered a filesystem for a very large-scale cluster supporting a huge number of readers of a single extremely large file, with each of the readers producing a unique output file. Oh, and we had to run multiple independent instances of this work-flow, so aggregate bandwidth and scalability were the primary concerns. After vetting the characteristics of the available solutions and running some highly focused benchmarks on likely contenders, we chose a storage cluster along with its specialized filesystem. In another case, a specialized filesystem was not acceptable–only NFS would do. We didn’t have a primary usage scenario to guide our selection process. We had to support a wide range of users running various HPC applications running on a cluster used for a wide variety of work, including benchmarking. A saving grace was that none of the applications were bottlenecked by either point-to-point or aggregate storage bandwidth; as an HPC cluster, latency was also not considered a key metric. A completely different solution, focusing on cost-effectiveness, capacity, and flexibility was implemented. Both of these very different implementations were placed in one organization. Trying to use either as a “standard” solution would have resulted in a failed implementation, wasted money, and a frustrated user community.

Back to the conflicting requirements, what are some of them?

User access scenarios. Is there a single MPI program using MPI-IO? Are there multiple of these? Are there programs using conventional I/O? Are there a few very large files with many many readers? What about written files? Is concurrent write access required? Multiple-reader, single-writer? Are there tens, hundreds, thousands, or even tens of thousand of files?

File access patterns and I/O performance. Do large streaming sequential reads and writes dominate? Perhaps random I/O? The former stresses uncached bandwidth while the latter is more latency sensitive, measured in IOPS (I/O per second) by the storage providers. Is point-to-point storage bandwidth important? Alternatively, is cluster-wide aggregate bandwidth the key measure? Are you doing gigabytes of I/O against terabytes of files or are you doing terabytes of I/O against gigabytes of files? Hey, maybe you even read files backwards! Not a joke, IOzone has such a test to support at least one ISV application’s file access patterns.

Data access methods. Is a block device needed? Not likely since we’re talking about clustered file systems, but what about the file system itself? Some systems only support NFS and CIFS, others support that plus offer their own, usually higher performance filesystem. Clearly, the latter could bring some OS/distro compatibility issues into consideration. Still others only provide their own file system, with NFS access only being available by exporting the cluster FS from some client system.

Reliability. Do you require reliable data storage? Is the data so critical that you prevent loss by having multiple copies saved in multiple locations? If it is that important, you better have multiple copies replicated in multiple locations! Most clustered file systems don’t provide this level of reliability, but you may well need to factor in such levels of reliability as backup systems behind the cluster FS. Of course, it may well be better/faster/cheaper for you to regenerate the data than to retrieve it from backup.

The trade-off. In the best of all worlds, you get all qualities at the same time for the same dime. We don’t live there. For example, improving uncached bandwidth usually means multiple data paths from multiple physical devices. Such bandwidth improvements come at the expense of latency and possibly reliability. Latency can be reduced by caching or SSDs, but those often come with size limitations, leaving very large files in the lurch.

At the end of the day, we usually cannot afford to maximize all metrics at the same time. So, we have to choose to spend our money on those features are important to us, and then find the file systems that have made those metrics their primary focus. That’s the real can o’ worms, figuring out which cluster file system will best support your needs.


Well, since no one has jumped in yet…. :)

From my perspective there is not one best clustered (parallel) file system. Each one has its pluses and minuses. I can go through each of them and give you my opinion, but I’m sure not too many people would be interested in that :) But let me mention features that I think would make a good parallel file system.

- High-speed (natch) with a good MPI-IO capable library or API
- Very reliable with excellent data protection features. An example of this is Panasas’ PanFS that has the ability to scan storage media for sectors that have gone bad and restore the data that was in the bad sector (no one else can do this)
- Dynamic HSM (the ability to move data from one performance layer to another automatically based on defined criteria)
- User-space (Patching kernels or having modules for kernels is evil)
- Object oriented (this makes like easier on so many levels)
- Reasonably priced (had to through that one in, but notice that I didn’t say cheap – I want companies to remain in business to support and develop the software).
- Don’t charge for client drivers
- Multi-cluster support (clusters of clusters)
- TCP and IB support (not IPoIB)
- Easy to scale (perhaps scale performance and capacity independently)
- Easy to administer (I hate to name names, but Lustre is a nightmare in this department)
- NFS and CIFS support
- High-speed drivers for Linux and Windows
- Extensive monitoring tools (can’t have enough of these)

I’m sure there are other things as well, I just can’t think of them at the moment. :

If you are to comment on these ideals, please do.




Belated disclaimer, I work for Intel


While I’m here, I might as well make a few comments about pNFS. In my previous comment, I mentioned some ideals for a parallel file system. pNFS has some of those ideals. In particular, it will come as part of Linux at some point so while it’s not really user-space it is open-source and easy to configure and requires no kernel patching or new kernel modules. It should also be able to mix various kinds of storage – file based (like NFS), block based (like SANs), and object based (Panasas and Lustre). Pretty cool stuff.

Many people have been waiting for pNFS or NFS v4.1 to be ratified or accepted or whatever it’s called, for a while. I was hoping it would become a standard in 2008, but now I’m hoping for 2009 (kind of like waiting for 10GigE to come down in cost – but I digress). There is very active development going on with the Linux pNFS server and client. If you look around you can also find some instructions on using GFS2 or other file systems as part of a pNFS file system.

pNFS, while not a true panacea, should give clusters a leg up. It should have good performance, it’s scalable, it will come with Linux kernels and distros with no patching, it works with various kinds of storage, and it’s a standard. Don’t underestimate this last point. There is only one standard file system in the world – NFS. That’s it. pNFS will be the only standard parallel file system in the world. That has great implications for cross-platform issues and cross-OS issues.



dnl made some great points and he/she didn’t get into great detail. The fundamental problem is that parallel storage is not always easy. In fact, almost all of the time it’s not easy.

I think the fundamental answer to the question of which one is best (or better) is that it depends upon the details of the situation. How’s that for a definite answer :) So how do we determine which one is best for us? I’m glad you asked.

The first step is understanding your applications. As dnl pointed out, you need to understand all kinds of aspects of how your code performs IO:

- Using local storage (single disk), how much of the wall clock time is spent doing IO? (I am utterly surprised at how many have no clue other than the generic answer – “we needs lots of IO”)
- Does the application use MPI-IO? Do all MPI processes participate in the IO?
- Does the rank-0 alone do the IO for all processes?
- How many files are read? written?
- What kind of “chunk” size do you use for these read/write functions? Is it large (MB’s?) or is it small (bytes or small KBs)?
- Are there any lseek’s in between read/writes?
- If there are lseeks, does the application skip backward in the file and then read forward? (FEM applications do this quite often).

There are more aspects you need to consider as well. Ultimately, you need to be able to profile the IO performance of your application or get some glimpse into what it’s doing.

Armed with this information, you now know what kind of throughput and IOPs the application is likely to need. You also know how the application is “performing” it’s IO and what factors are important (e.g. what access pattern is it using).

So how do you get this information? I’m glad you asked that question as well. I’ve been working on a code (it’s really a Perl script) that takes output from “strace” and does a quick analysis of it. It will tell you how much time was spent doing IO (wall clock time), it will tell you the number of IO commands, the throughput (MB/s) for each read/write command, a per file summary of what the file was used for (reading and writing), etc.

I’m still working on other aspects of the script as well. For example, I want to have the script output a “simulation” of your code. That is, it’s a dummy code that just performs your IO functions with dummy data. Then you can test this simulator on all types of storage systems to get accurate performance results for your code without actually having to pass around your code or your data!

It’s still a work in progress but if anyone is interested, here’s the link:


Let me know if you have any problems (I think my email is in the script). Just don’t try the analyzer or really huge strace outputs – GB sizes. You run out of memory (I’m working on a DB version of the analyzer that can handle any size file).


Howdy would you mind stating which blog platform you’re using? I’m looking to start my own blog soon but I’m having a tough time making a decision between BlogEngine/Wordpress/B2evolution and Drupal. The reason I ask is because your design seems different then most blogs and I’m looking for something unique. P.S Apologies for being off-topic but I had to ask!

Wow, marvelous blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your web site is magnificent, as well as the content!

I actually wanted to make a word in order to express gratitude to you for those wonderful facts you are posting here. My long internet investigation has finally been paid with incredibly good facts to exchange with my best friends. I ‘d point out that we site visitors are extremely lucky to be in a fabulous website with many perfect people with good opinions. I feel rather lucky to have encountered your entire site and look forward to many more enjoyable minutes reading here. Thanks again for a lot of things.

Here is a great Blog You may Find Fascinating that we encourage you to visit.

Wonderful story, reckoned we could combine several unrelated information, nevertheless really really worth taking a search, whoa did a single find out about Mid East has got a lot more problerms as well.

Please take a look at the websites we comply with, including this one particular, because it represents our picks from the web.

Below you?ll uncover the link to some web pages that we believe you should visit.

Every after in a although we choose blogs that we study. Listed beneath would be the newest websites that we opt for.

Below you will locate the link to some sites that we consider you’ll want to visit.

Sites of interest we’ve a link to.

Say, you got a nice post.Thanks Again. Really Great.

Check below, are some completely unrelated web sites to ours, even so, they may be most trustworthy sources that we use.

Generally I do not read article on blogs, however I wish to say
that this write-up very compelled me to check out and do so!
Your writing taste has been surprised me. Thanks, quite great article.

Here are some links to web sites that we link to because we feel they’re really worth visiting.

That would be the end of this article. Here you?ll uncover some web sites that we believe you will enjoy, just click the links.

Nice blog here! Also your site loads up fast! What hosting company are you using?
May I get the affiliate link to your host? I wish my
website loaded up as fast as yours lol

My blog: DevinQNannie

Right here is the right blog for anyone who wants
to find out about this topic. You realize so much its almost hard
to argue with you (not that I personally would want to?HaHa).

You definitely put a new spin on a topic that’s been discussed for many years.
Great stuff, just great!

Have a look at my website; GennieEHones

We came across a cool web site which you could get pleasure from. Take a search in case you want.

I’m not sure exactly why but this web site is loading very slow
for me. Is anyone else having this issue or is it a problem on my end?
I’ll check back later and see if the problem still exists.

Feel free to surf to my blog … HalinaULat

Leave a Reply