POSIX IO Must Die!

POSIX IO is becoming a serious impediment to IO performance and scaling. POSIX is one of the standards that enabled portable programs and POSIX IO is the portion of the standard surrounding IO. But as the world of storage evolves with greatly increasing capacities and greatly increasing performance, it is time for POSIX IO to evolve or die.

All of these demands can put a HUGE burden on the file system with the most probable result that the throughput will be greatly reduced while the metadata is queried (don’t forget this is called “high” performance computing, not “mediocre” performance computing so a reduction in throughput is not acceptable).

Imagine a user who runs this command several times a minute. The process is repeated every time because the user wants to know the size of a particular file (or files). Multiply this by several users, and you can see that the load on the metadata can become enormous. If you think that users don’t do this, you would be very surprised to learn that in reality there are hundreds if not thousands of HPC users who precisely do this on a regular basis. In addition, there are application scripts that perform these checks as well.

In my previous career at an HPC storage manufacturer there was a particular customer who became irate beyond all rationality because the results of an “ls -lsa” command did not return the exact size of the file the instant the command was executed. They had scripts that depended upon this command to launch applications and without a precise number at any specific instance their process was stalled. At the same time, they could not tolerate any loss of throughput performance without a further reduction in performance.

Almost all reasonable users can accept that it can take time to return the results of a command but they want accurate information about the state of their files. But there are also users who want the command to return immediately and who also don’t want much (any) reduction in throughput. The compromise that was discussed in the HECEWG was to offer two options to the users. The first option is to give back accurate file information without much reduction in throughput at the expense of taking more time. The second option is sort of a “lazy” metadata update. In this option, the metadata is updated and stored in a cache when the system load “pauses” or the load drops a bit. Then when a user wants file information with this option, the cache is quickly queried and the result is returned to the user. It’s not the most accurate information since it lags the true information by an amount of time, but it has ZERO impact on the storage performance.

If you think this scenario of metadata impacting performance is not likely to happen to you, I suggest you just wait a couple of years or try a small, slow drive in your quad-core system while running several applications. We now have desktops with 4 cores and 6 cores are coming quickly. If all of the cores are used by applications and a good percentage of them are performing IO then you will certainly see some serious stress on your file system. This stress includes both metadata operations and throughput. Not everyone runs a database on their desktop but almost everyone certainly runs a web browser. Take a look at the number and size of the cached files for your browser and I think you will be shocked. The same is true for games. The stress these applications impose on file systems is something not dreamed of several years ago.

So What Happened?

As you are probably aware, the proposed extensions or relaxations have not been approved by the Austin Group. The last update to the HECEWG was four years ago (2006). Why weren’t they accepted? While I don’t know the answer to that question my supposition is that there was not enough of an appetite by enough companies and users to justify the changes. Without companies who are willing to make changes and without enough demand from users there is really no change in the standard. And why should there be? The mighty cruise ship is traveling steadily and people are having a good time. However, history tells us that the HPC world, while smaller than the enterprise world, usually sees problems several years before they happen. Has HECEWG shown that people need more than the POSIX cruise ship can provide to the point where people are going to be forced to jump off the POSIX cruise ship? Given the advances in systems I think they have.

What Next?

While the cruise ship analogy is useful in making a point, the computing world is not coming to an end because the extensions/relaxations were not passed. We are still growing systems beyond 1 PetaFLOPS and storage capacity and throughput are almost doubling every year along with regulatory requirements forcing us to store all data for extended periods of time.

The HPC world is working very hard to keep up with demand. Share storage systems are growing in capacity at an ever increasing rate and throughput requirements are rising as fast as ever. Coupled with the large number of cores in systems all trying to write to the same shared storage and you can easily see the bottleneck developing.

Ideally POSIX should evolve to accommodate the situations in the HPC world anticipating that the same problems are coming to the enterprise world. However this hasn’t happened in four years and shows no signs of happening in the near future.

It may come to the point where new storage interfaces are developed that don’t even use POSIX because the problems have become so severe. There are signs of this happening already. Basically POSIX hasn’t evolved so people are starting to just go around it. This means that the standard is no longer useful and has become an impediment to progress.

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "POSIX IO Must Die!"

greimer

Myabe you\’re trying to avoid getting too technical in this article, but you haven\’t indicated a single good reason for any part of the existing Posix IO spec to be deleted, or changed, or other wise \’killed\’. It sounds like there just need to be extensions.

Exactly, precisely, what features of the existing spec force a performance bottleneck when they are implemented? Do they force a performance bottleneck in the sense that their implementation makes it *impossible* to meet I/O performance goals by any means (Pois-xompiant or not), or is it the case that their implementation simply does not take advantage of the I/O capabilities of the new hardware and systems?

Reply
maddognh

Perhaps the title of your article was to drive attention to your issue, but I basically agree with Greimer.

Secondly, from my observations the POSIX specifications almost always followed the \”need\” that was defined and felt. For the most part they did not look for new problems to solve and solve them, they documented how to solve existing problems in a standard way that people and companies could implement.

As to your user expecting, with a directory filled with millions of files and potentially hundreds of directories of meta data, that \”ls -lsa\” will return the information on the size of a particular file \”instantly\”, we used to call them \”bad user on device\”. I suggest giving this user a good class on shell commands and shell scripting. You might find the load on your systems drops dramatically.

maddog

Reply
greggwon

The problem is file system implementations and disk hardware performance. The I/O interfaces work fairly well. The kernel could schedule I/O ops better and filesystems can solve a lot of the problems with performance.

For example, if Linux licensing wasn\’t under GPL, then opensolaris ZFS could be ported to linux and the world would change dramatically.

Then you\’d be able to stack caching SSDs onto pools and do other things to better manage how often you used the extremely slow disk access bus.

I\’ve always preferred the extreme flexibility of the VMS event based I/O mechanisms to the functional based unix versions that codified into posix. Posix makes it possible for very minimal OSes to support portability. VMS had great I/O capabilities and really excelled at filling that spot for quite some time. Unfortunately Digital did not address all the other issues that were important to users for overall portability, cost and ease of use of the OS and so it got less interesting fast as UNIX accelerated out of the gates.

Reply
lescoke

As a systems level programmer; I have always seen the POSIX standards as defining a consistent programming interface for applications to use. A good example of a POSIX feature is that files, pipes, and network sockets allow the use of select/poll to block until the I/O system is ready or has more data for processing.

I was not aware that POSIX dictates how this consistent API is implemented in the kernel. The performance issues mentioned in the article seem to be more of a file-system or device driver interface limitation.

I agree with greggwon on VMS\’ very flexible event flag I/O completion capabilities. I\’m also fond of the event and other wait-object capabilities in Windows. I would like to see something similar added with the same level of portability I have come to love about POSIX compatibility.

The designers of windows seemed to put the same kind of thought into the file-system, process, and thread handles and their underlying event-object capabilities as was done with POSIX file handles, but the implementation of Winsock and some later I/O objects failed to continue a similar consistency.

Reply
rpnabar

1. Out of curiosity what is the HPC app that generates millions of files in a single directory?

2. Extension seems a better way out than trashing POSIX totally

3. I remember reading a book on the Unix philosophy once. Vaguely one of the principles there was on the lines of \”Avoid gaining performance at the cost of compatibility\”. I think that might be relevant here. If we killed POSIX what do we do with years of legacy apps that might depend on POSIX features.

Reply
rarecactus

There are already distributed filesystems out there that aren\’t constrained by POSIX.

PVFS is quite up-front and proud about the fact that it relaxes POSIX constraints where they interfere with performance. The grandaddy of all network filesystems, NFS, is *very* non-POSIX, with its multiple layers of unpredictable caching, and crude hacks to work around statelessness. The only distributed filesystem that I can think of that *is* POSIX is Lustre.

There\’s also a boatload of distributed key-value stores out there. Some of them guarantee very strict semantics; others guarantee almost nothing.

I guess what I\’m trying to say is, rather than complaining that your sports car can\’t seat your 3 kids, how about you buy a different car.

If your alternate choice gets popular enough, it will become standardized eventually. And there will be meetings. And donuts.

Reply
laytonjb

I thought I would post some clarifications for people asking for details.

I did not include any details in the article about specific use cases because there are article and other information explaining them in greater detail. Please use google to look for information around this issue by searching for \”Rob Ross\”, \”Gary Grider\” \”Garth Gibson\” \”Henry Newman\” and others. They have far more detailed information and I would just be parroting it back here.

But let me continue a little. The POSIX IO specifications don\’t necessarily just define the semantics of the something such as the \”open()\” It has a big impact on the implementation of the function. I\’m not enough of a kernel person (actually I\’m not one at all) to be able to explain this. Rob Ross from Argonne can explain this far better than I (so can Gary Grider).

I agree with Mad Dog about POSIX being something of a follower. It helps solve problems rather than be forward looking. However, in the case of HPC applications, the case was made for extensions to the standard and while I don\’t know the whole story, the result is that there are no extensions and the subject has been dropped for 4 years.

In the meantime hardware is progressing rapidly and the HPC community is spending time and effort doing one of two things: (1) how to get work within the POSIX standard and accept the limitations, or (2) try something new. As rarecactus points out PVFS2, lead primarily by Rob Ross, is doing (2) and going around POSIX to improve performance. As pointed out in my article there is a danger in this in that applications may not be portable any more. It\’s up to you to decide if this is good or bad – personally I\’m willing to accept some deviation but I would much rather have POSIX extensions/additions so that we can keep a single standard. As rarecactus points outit\’s possible that these \”new\” solutions can become part of the standard (as Mad Dog points out) or a completely new standard could be evolved from it. Is this good or bad?

On to the more fun topic about crazy users and applications :)

I can\’t name the institution or the application that writes millions of files to a single directory, but I will say that it\’s a university where I learned about this but I don\’t want to say which one and I don\’t know the name of the application. However, the application is in the bio world or the chemistry world. The staff at the university are trying to help the application developer but it\’s difficult to convince a researcher to invest time into a code when they just want to get an answer. There was a blog about his fairly recently in the UK (don\’t have the link handy) and explains this quite well. Researchers or users typically don\’t want to spend time working on an application – they just want the answer. They assume that you can jut through hardware at the problem if the performance isn\’t there. What\’s even worse is that these applications that are sometimes just one-off\’s take on a life of their own and become non-one-off\’s :) Then life gets interesting.

As for the \”ls -lsa\” issue… that one is trickier. As Mad Dog mentions, it\’s worth teaching the users how to write better scripts but many times this isn\’t possible. By changing the script you are changing the process and that means lots of $$ to revalidate the process. Users and companies are very reluctant to do this (This isn\’t my opinion BTW. This is what customers tell me). Customers then viewed the issues surrounding this as a bug in the product and not a bug in their process (\”Doctor, doctor! It hurts when I do this!!\”).

If you take away one thing from the blog, take away the fact that storage doesn\’t get smaller or slower and we as a community need to be more pro-active about solving them. HPC is the proverbial canary in the coal mine. What they see will affect more users in a few years. So I think we need to prod POSIX to think a bit more proactively rather than reactively (sorry Mad Dog).

BTW – love the comment \”bad user on device\”. That is priceless and so descriptive.

Thanks for the comments everyone.

Jeff

Reply
strombrg

The article and the article\’s title seem at odds with one another, the latter being rather sensationalistic. I expect more from Linux Magazine, but oh well.

Anyway, there have been some good responses to the article, so I have only one thing to add: POSIX I/O can be pretty fast in Some Applications if you use it with a Cache Oblivious Algorithm. Below is a URL to an example in Python. It\’s just a simple GUI pipemeter written overtop of POSIX I/O, but I didn\’t want people to complain that it was a bottleneck (to some, \”GUI\” and \”Python\” and \”POSIX I/O\” mean \”slow\”, though I believe this shows that it need not, at least, not in all cases):
http://stromberg.dnsalias.org/~dstromberg/gprog/

Reply
james.oden

I am unimpressed with this article. It makes emotionally charged statements about the oldness of the POSIX I/O standards, but gives no true reason that these actual interfaces are impeding progress. With most API\’s performance changes can occur under the hood without changing the API. In some cases this is not true. It may not be true of the POSIX I/O API\’s, but this article has given nothing to show that that is the case. \”Old\” is not a good enough reason (frankly, its not even a reason).

Reply

Leave a Reply to greggwon Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>