POSIX IO Must Die!

POSIX IO is becoming a serious impediment to IO performance and scaling. POSIX is one of the standards that enabled portable programs and POSIX IO is the portion of the standard surrounding IO. But as the world of storage evolves with greatly increasing capacities and greatly increasing performance, it is time for POSIX IO to evolve or die.

POSIX (Portable Operating System Interface for Unix) is a set of standards that define the application programming interface (API) as well as some shell and utility interfaces. It was developed primarily for *nix operating systems but actually any operating system can utilize the standards. POSIX IO (not an official name) is the portion of the standard that defines the IO interface for POSIX compliant applications. Function such as read(), write(), open(), close(), lseek(), fwrite(), fread(), and so on, are defined including their errors. But these definitions were first codified in 1988 – 22 years ago!

During this time storage has changed dramatically. We now have thousands of systems with performance in the TeraFLOPS range including some people who have clusters of this size in their homes (nudge, nudge, wink, wink) and there are PetaFLOPS systems at strategic sites today. These systems can have hundreds of thousands of processors with a large percentage perhaps performing IO. That means there is the potential for a great deal of IO happening at the same time, usually to a single shared file system, including the possibility of a large number of nodes all writing to the same file.

Sitting in the middle of this is POSIX with an interface that has not appreciably changed in 22 years! For several years people have been asking for changes or relaxation of POSIX standards to improve IO performance. The reasons for these requests are fairly simple – to improve the IO performance of applications. Concurrently a change or relaxation in POSIX IO could the development of new storage mechanisms to improve not only application performance but management, reliability, portability, and scalability.

POSIX

POSIX is one of the big reasons that the world of *nix allows you to take a program from one operating system to another operating system that are both POSIX compliant. You are not limited to *nix operating systems since the POSIX standard is open to anyone.

The original POSIX standard was developed by IEEE and was labeled as “IEEE Std 1003.1-1988.” Prior to 1997 it had several sections including:


  • POSIX.1: Core Services (This includes IO port interface and control)
  • POSIX.1b: Real-time extensions (IEEE Std 1003.1b-1993)
  • POSIX.1c: Threads extensions (IEEE Std 1003.1c-1995)
  • POSIX.2: Shell and Utilities (IEEE Std 1003.2-1992)

After 1997 the Austin Group which is a combination of the Open Group and the international ISO group has been responsible for the reorganization of the POSIX standard as well as revisions to the standard. Consequently, the POSIX standard is an international standard.

While the title of the article says that “POSIX IO Must Die” POSIX is a very important standard. It defines much of the general behavior we have come to know in our Linux systems. It also allows us to take programs written for Linux and run them on AIX, HP-UX, BSD, and even Mac OS X (this assumes that all dependent libraries are available on these systems but that’s outside the scope of POSIX). It allows the world to write standard libraries that use POSIX interfaces and make them available for applications. Without POSIX writing applications would be much more difficult.

For those reading this article that are a bit younger probably don’t remember the days when there was really no standard and writing programs for different operating systems was a very difficult process. As a famous person once described it, “… cats and dogs living together! Mass hysteria!…” Taking a program written on a VAX with VMS and then running it on a Unix system (a sane decision if you ask me) was problematic because of the lack of common interface standards. I remember writing an application on a VAX system in graduate school and then running it on a larger *nix based system because it was faster. I spent a great deal of time bugging the system support staff about porting simple routines because of the lack of POSIX compatibility between the two operating systems.

At the same time, the POSIX standard, while evolving, is 22 years old! (I love the rule of three). POSIX has become this extremely large cruise ship that people love to travel upon because the food and the entertainment are always consistent and well defined. However if the food or the entertainment on the cruise ship aren’t to your liking or are preventing you from really having a good time, then it can seem quite limiting. This is exactly the case with POSIX IO for applications and organizations wanting high performance IO.

Changes for Better Performance

As systems started to scale to large numbers of processors and larger problems were tackled, it was soon realized that storage systems were becoming bottlenecks. However, the problem didn’t necessarily lie with the file system but with the standards for interfacing the applications with the storage. This was particularly noticed for applications where there were many “writers” to a common shared file system.

A few years ago, a sub-group of the Open Group was created called the High End Computing Extensions Working Group (HECEWG). The goal of this group was to create a set of extensions or relaxations to POSIX that allowed applications to basically have better IO performance including better scaling. The business case for this is presented in, “A Business Case for Extensions to the POSIX I/O API for High End, Clustered, and Highly Concurrent Computing”. The group came up with a few proposals for changes for the Open group that can be summarized as:


  • Allowing changes to the stat() function to dramatically improve performance when discovering information about the files in a file system
  • Opening a large number of files using a shared file system
  • Opening a single file from a large number of nodes on a shared file system
  • Creating a list of IO functions that you can send to the file system for fulfillment (reduces the number of individual IO operations)

Another document co-authored by several members of the HECEWG, gives a longer list of changes and efforts surrounding the need for improved IO performance. From the document, Relaxation of POSIX Semantics for parallelism


  • “Scalable metadata operations in a single directory”
  • “NFSv4 security and the pNFS effort to allow NFS to get more native file system performance through separation of data and control which enables parallelism”
  • “I/O Middleware enhancements to enable dealing with small, overlapped, and unaligned I/O”
  • “Tighter integration between high level I/O libraries and I/O middleware”
These proposed changes are very important even if you are not an HPC user. Let’s look at the first item to explain why.

Metadata Operations

The first proposed change, scalable metadata operations in a single directory, affects a very large number of people, not just HPC. As an experiment, run the following command on your system in the root directory (“/”).

% time find . -type f | wc -l


This command will count all the files from the current directory (“.”) on down the tree. If you do this from the root you will get a count of all the files on your system. For my system, the result was the following:

606914

real    1m43.424s
user    0m0.796s
sys     0m3.024s


So it took almost 1.5 minutes to count all the files and there were 606,914 files on my home system. Just a few years ago this would have been perhaps 100,000 or so. Now imagine a single file system having to keep track of about a half a million files without making a mistake or having any corrupt data. This is just for a desktop.

In the HPC world there are applications that can produce millions of files in a single directory per node. Moreover there are file systems with well over 1 PB (Petabyte) of data and hundreds of applications running at the same time all producing data to a single shared file system. In the middle of this ballet a user runs the command, “ls -lsa” to see if the file that his application is writing to is changing size. For this command, the file system has to walk the entire directory tree. Then it has to read the metadata associated with the appropriate files of which several applications may be reading or writing at a particular time. Then the results are formatted and presented to the user. It can take a great deal of time to perform all of these operations using lots of CPU time and putting the file system under a great deal of stress. While these metadata operations are happening the storage system has to perform producing high levels of throughput and IOPS.

Comments on "POSIX IO Must Die!"

greimer

Myabe you\’re trying to avoid getting too technical in this article, but you haven\’t indicated a single good reason for any part of the existing Posix IO spec to be deleted, or changed, or other wise \’killed\’. It sounds like there just need to be extensions.

Exactly, precisely, what features of the existing spec force a performance bottleneck when they are implemented? Do they force a performance bottleneck in the sense that their implementation makes it *impossible* to meet I/O performance goals by any means (Pois-xompiant or not), or is it the case that their implementation simply does not take advantage of the I/O capabilities of the new hardware and systems?

Reply
maddognh

Perhaps the title of your article was to drive attention to your issue, but I basically agree with Greimer.

Secondly, from my observations the POSIX specifications almost always followed the \”need\” that was defined and felt. For the most part they did not look for new problems to solve and solve them, they documented how to solve existing problems in a standard way that people and companies could implement.

As to your user expecting, with a directory filled with millions of files and potentially hundreds of directories of meta data, that \”ls -lsa\” will return the information on the size of a particular file \”instantly\”, we used to call them \”bad user on device\”. I suggest giving this user a good class on shell commands and shell scripting. You might find the load on your systems drops dramatically.

maddog

Reply
greggwon

The problem is file system implementations and disk hardware performance. The I/O interfaces work fairly well. The kernel could schedule I/O ops better and filesystems can solve a lot of the problems with performance.

For example, if Linux licensing wasn\’t under GPL, then opensolaris ZFS could be ported to linux and the world would change dramatically.

Then you\’d be able to stack caching SSDs onto pools and do other things to better manage how often you used the extremely slow disk access bus.

I\’ve always preferred the extreme flexibility of the VMS event based I/O mechanisms to the functional based unix versions that codified into posix. Posix makes it possible for very minimal OSes to support portability. VMS had great I/O capabilities and really excelled at filling that spot for quite some time. Unfortunately Digital did not address all the other issues that were important to users for overall portability, cost and ease of use of the OS and so it got less interesting fast as UNIX accelerated out of the gates.

Reply
lescoke

As a systems level programmer; I have always seen the POSIX standards as defining a consistent programming interface for applications to use. A good example of a POSIX feature is that files, pipes, and network sockets allow the use of select/poll to block until the I/O system is ready or has more data for processing.

I was not aware that POSIX dictates how this consistent API is implemented in the kernel. The performance issues mentioned in the article seem to be more of a file-system or device driver interface limitation.

I agree with greggwon on VMS\’ very flexible event flag I/O completion capabilities. I\’m also fond of the event and other wait-object capabilities in Windows. I would like to see something similar added with the same level of portability I have come to love about POSIX compatibility.

The designers of windows seemed to put the same kind of thought into the file-system, process, and thread handles and their underlying event-object capabilities as was done with POSIX file handles, but the implementation of Winsock and some later I/O objects failed to continue a similar consistency.

Reply
rpnabar

1. Out of curiosity what is the HPC app that generates millions of files in a single directory?

2. Extension seems a better way out than trashing POSIX totally

3. I remember reading a book on the Unix philosophy once. Vaguely one of the principles there was on the lines of \”Avoid gaining performance at the cost of compatibility\”. I think that might be relevant here. If we killed POSIX what do we do with years of legacy apps that might depend on POSIX features.

Reply
rarecactus

There are already distributed filesystems out there that aren\’t constrained by POSIX.

PVFS is quite up-front and proud about the fact that it relaxes POSIX constraints where they interfere with performance. The grandaddy of all network filesystems, NFS, is *very* non-POSIX, with its multiple layers of unpredictable caching, and crude hacks to work around statelessness. The only distributed filesystem that I can think of that *is* POSIX is Lustre.

There\’s also a boatload of distributed key-value stores out there. Some of them guarantee very strict semantics; others guarantee almost nothing.

I guess what I\’m trying to say is, rather than complaining that your sports car can\’t seat your 3 kids, how about you buy a different car.

If your alternate choice gets popular enough, it will become standardized eventually. And there will be meetings. And donuts.

Reply
laytonjb

I thought I would post some clarifications for people asking for details.

I did not include any details in the article about specific use cases because there are article and other information explaining them in greater detail. Please use google to look for information around this issue by searching for \”Rob Ross\”, \”Gary Grider\” \”Garth Gibson\” \”Henry Newman\” and others. They have far more detailed information and I would just be parroting it back here.

But let me continue a little. The POSIX IO specifications don\’t necessarily just define the semantics of the something such as the \”open()\” It has a big impact on the implementation of the function. I\’m not enough of a kernel person (actually I\’m not one at all) to be able to explain this. Rob Ross from Argonne can explain this far better than I (so can Gary Grider).

I agree with Mad Dog about POSIX being something of a follower. It helps solve problems rather than be forward looking. However, in the case of HPC applications, the case was made for extensions to the standard and while I don\’t know the whole story, the result is that there are no extensions and the subject has been dropped for 4 years.

In the meantime hardware is progressing rapidly and the HPC community is spending time and effort doing one of two things: (1) how to get work within the POSIX standard and accept the limitations, or (2) try something new. As rarecactus points out PVFS2, lead primarily by Rob Ross, is doing (2) and going around POSIX to improve performance. As pointed out in my article there is a danger in this in that applications may not be portable any more. It\’s up to you to decide if this is good or bad – personally I\’m willing to accept some deviation but I would much rather have POSIX extensions/additions so that we can keep a single standard. As rarecactus points outit\’s possible that these \”new\” solutions can become part of the standard (as Mad Dog points out) or a completely new standard could be evolved from it. Is this good or bad?

On to the more fun topic about crazy users and applications :)

I can\’t name the institution or the application that writes millions of files to a single directory, but I will say that it\’s a university where I learned about this but I don\’t want to say which one and I don\’t know the name of the application. However, the application is in the bio world or the chemistry world. The staff at the university are trying to help the application developer but it\’s difficult to convince a researcher to invest time into a code when they just want to get an answer. There was a blog about his fairly recently in the UK (don\’t have the link handy) and explains this quite well. Researchers or users typically don\’t want to spend time working on an application – they just want the answer. They assume that you can jut through hardware at the problem if the performance isn\’t there. What\’s even worse is that these applications that are sometimes just one-off\’s take on a life of their own and become non-one-off\’s :) Then life gets interesting.

As for the \”ls -lsa\” issue… that one is trickier. As Mad Dog mentions, it\’s worth teaching the users how to write better scripts but many times this isn\’t possible. By changing the script you are changing the process and that means lots of $$ to revalidate the process. Users and companies are very reluctant to do this (This isn\’t my opinion BTW. This is what customers tell me). Customers then viewed the issues surrounding this as a bug in the product and not a bug in their process (\”Doctor, doctor! It hurts when I do this!!\”).

If you take away one thing from the blog, take away the fact that storage doesn\’t get smaller or slower and we as a community need to be more pro-active about solving them. HPC is the proverbial canary in the coal mine. What they see will affect more users in a few years. So I think we need to prod POSIX to think a bit more proactively rather than reactively (sorry Mad Dog).

BTW – love the comment \”bad user on device\”. That is priceless and so descriptive.

Thanks for the comments everyone.

Jeff

Reply
strombrg

The article and the article\’s title seem at odds with one another, the latter being rather sensationalistic. I expect more from Linux Magazine, but oh well.

Anyway, there have been some good responses to the article, so I have only one thing to add: POSIX I/O can be pretty fast in Some Applications if you use it with a Cache Oblivious Algorithm. Below is a URL to an example in Python. It\’s just a simple GUI pipemeter written overtop of POSIX I/O, but I didn\’t want people to complain that it was a bottleneck (to some, \”GUI\” and \”Python\” and \”POSIX I/O\” mean \”slow\”, though I believe this shows that it need not, at least, not in all cases):
http://stromberg.dnsalias.org/~dstromberg/gprog/

Reply
james.oden

I am unimpressed with this article. It makes emotionally charged statements about the oldness of the POSIX I/O standards, but gives no true reason that these actual interfaces are impeding progress. With most API\’s performance changes can occur under the hood without changing the API. In some cases this is not true. It may not be true of the POSIX I/O API\’s, but this article has given nothing to show that that is the case. \”Old\” is not a good enough reason (frankly, its not even a reason).

Reply

Mudbox is a software for 3D sculpting and painting which ddeekdb

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>