POSIX IO is becoming a serious impediment to IO performance and scaling. POSIX is one of the standards that enabled portable programs and POSIX IO is the portion of the standard surrounding IO. But as the world of storage evolves with greatly increasing capacities and greatly increasing performance, it is time for POSIX IO to evolve or die.
All of these demands can put a HUGE burden on the file system with the most probable result that the throughput will be greatly reduced while the metadata is queried (don’t forget this is called “high” performance computing, not “mediocre” performance computing so a reduction in throughput is not acceptable).
Imagine a user who runs this command several times a minute. The process is repeated every time because the user wants to know the size of a particular file (or files). Multiply this by several users, and you can see that the load on the metadata can become enormous. If you think that users don’t do this, you would be very surprised to learn that in reality there are hundreds if not thousands of HPC users who precisely do this on a regular basis. In addition, there are application scripts that perform these checks as well.
In my previous career at an HPC storage manufacturer there was a particular customer who became irate beyond all rationality because the results of an “ls -lsa” command did not return the exact size of the file the instant the command was executed. They had scripts that depended upon this command to launch applications and without a precise number at any specific instance their process was stalled. At the same time, they could not tolerate any loss of throughput performance without a further reduction in performance.
Almost all reasonable users can accept that it can take time to return the results of a command but they want accurate information about the state of their files. But there are also users who want the command to return immediately and who also don’t want much (any) reduction in throughput. The compromise that was discussed in the HECEWG was to offer two options to the users. The first option is to give back accurate file information without much reduction in throughput at the expense of taking more time. The second option is sort of a “lazy” metadata update. In this option, the metadata is updated and stored in a cache when the system load “pauses” or the load drops a bit. Then when a user wants file information with this option, the cache is quickly queried and the result is returned to the user. It’s not the most accurate information since it lags the true information by an amount of time, but it has ZERO impact on the storage performance.
If you think this scenario of metadata impacting performance is not likely to happen to you, I suggest you just wait a couple of years or try a small, slow drive in your quad-core system while running several applications. We now have desktops with 4 cores and 6 cores are coming quickly. If all of the cores are used by applications and a good percentage of them are performing IO then you will certainly see some serious stress on your file system. This stress includes both metadata operations and throughput. Not everyone runs a database on their desktop but almost everyone certainly runs a web browser. Take a look at the number and size of the cached files for your browser and I think you will be shocked. The same is true for games. The stress these applications impose on file systems is something not dreamed of several years ago.
So What Happened?
As you are probably aware, the proposed extensions or relaxations have not been approved by the Austin Group. The last update to the HECEWG was four years ago (2006). Why weren’t they accepted? While I don’t know the answer to that question my supposition is that there was not enough of an appetite by enough companies and users to justify the changes. Without companies who are willing to make changes and without enough demand from users there is really no change in the standard. And why should there be? The mighty cruise ship is traveling steadily and people are having a good time. However, history tells us that the HPC world, while smaller than the enterprise world, usually sees problems several years before they happen. Has HECEWG shown that people need more than the POSIX cruise ship can provide to the point where people are going to be forced to jump off the POSIX cruise ship? Given the advances in systems I think they have.
While the cruise ship analogy is useful in making a point, the computing world is not coming to an end because the extensions/relaxations were not passed. We are still growing systems beyond 1 PetaFLOPS and storage capacity and throughput are almost doubling every year along with regulatory requirements forcing us to store all data for extended periods of time.
The HPC world is working very hard to keep up with demand. Share storage systems are growing in capacity at an ever increasing rate and throughput requirements are rising as fast as ever. Coupled with the large number of cores in systems all trying to write to the same shared storage and you can easily see the bottleneck developing.
Ideally POSIX should evolve to accommodate the situations in the HPC world anticipating that the same problems are coming to the enterprise world. However this hasn’t happened in four years and shows no signs of happening in the near future.
It may come to the point where new storage interfaces are developed that don’t even use POSIX because the problems have become so severe. There are signs of this happening already. Basically POSIX hasn’t evolved so people are starting to just go around it. This means that the standard is no longer useful and has become an impediment to progress.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).