One Billion Dollars! Wait… I Mean One Billion Files!!!
The world is awash in data. This fact is putting more and more pressure on file systems to efficiently scale to handle increasingly large amounts of data. Recently, Ric Wheeler from Redhat experimented with putting 1 Billion files in a single file system to understand what problems/issues the Linux community might face in the future. Let's see what happened...
While the previous tests were for 1 million files, they did point out some glaring differences in the various file systems. But Ric wanted to really push the envelope so he built a very large storage array of almost 100 TB’s using 2TB SATA drives and drive arrays. He formatted the file system with ext4 and then ran about the same tests as he did for the 1 million files but used 20KB files (1 billion of them). Here’s a quick summary of what he found.
Make the file system (mkfs)
Approximately 4 hours
Fill the file system (1 billion files of about 20KB each)
Approximately 4 days
Run a file system check (fsck) with 1 billion files
XFS still has problems with meta-data intensive workloads.
Faster storage can be helpful. In particular, Ric mentioned that btrfs can use SSD’s for metadata and leave the bulk data on the slower storage.
I hope everyone has also read my article series about a patch that allows you to use SSD’s to cache block devices (storage devices). There are other options for using SSD’s for caching including flashcache and one patch set that has caught the eye of developers gave btrfs the ability to take the “temperature” of data so that it could be moved to a faster or slower device as needed (the movement aspect was not in the proposed patch set).
Ric underscores that the rates are consistent for zero length files and small files (i.e. making the file really small didn’t help the overall performance rates).
The absolute best thing from this testing is that current file systems can handle 1 billion files. They may not be as fast as you want on SATA drives but they did function and you could actually fsck the file system (if you need more speed you can always flip for some MLC based SSD’s).
Ric talked about some specific lessons they learned from the testing:
When he fsck-ed the 1 billion file ext4 file system (a total of 70TB capacity), it took about 10GB of memory during the operation. This may sound like a great deal of memory and on today’s laptops and desktops this may be true, but on servers this amount of memory is fairly common.
Using xfs_repair (the XFS file system repair tool), on a large file system took almost 30GB of memory which is quite a bit of memory even for servers.
Ric also mentioned that with a file system with 1 billion files, running an “ls” command is perhaps not a good idea. The reason is that ls uses both the readdir and stat system functions which means that all of the metadata has to be touched twice. The result is that it takes a great deal of time to perform the “ls” operation. He did point out a way to reduce the time but performing an “ls” is still not a fast operation. Moreover, he did point out that the performance of file enumeration, which is what “ls” does to some degree, proceeds at the rate of file creation so it could take quite a while to perform the “ls” command.
There have been proposal for improving the responsiveness of “ls” for large file systems but nothing has been adopted universally. One proposal is to do what is termed “lazy updates” on metadata. In this concept, a summary of the metadata is kept within the file system so that when a user performs an “ls” operation, the summary is quickly read and the results are given to the user. However, the “lazy” part of the name implies that the summary data may not be absolutely accurate. It may not have the file sizes absolutely correct or it may not have the file that was created a microsecond prior to the “ls”. But the point of lazy updates is to allow users to get an idea of the status of their file system. Of course, “ls” could have an option such as “-accurate” that tells the command to use readdir and stat to get the most accurate state of the file system possible.
However, even this “accurate” option may not get you the most accurate information. Because there are so many files, by the time the last files have been accessed the status of the first files may have changed. To get “hyper-accurate” values for large file systems, you need to freeze the file system, perform the operation, and then continue to use the file system as normal. I’m not sure how many people would be willing to do this. But the problem is that I’ve seen people us the “ls” command as part of an application script. Not the brightest idea in the world in my opinion, but it allowed them to compute what they wanted.
Finally Ric underscored two other gotchas with extremely large file systems. The first one is the remote replication of backup to tape is a very long process. This is because, again, enumeration and the read rate of the file systems drop in terms of performance while other file system operations happen concurrently. This increases the length of time to perform an already long series of operations.
The second thing Ric highlighted was that operations such as backup, that take a long time, could be prone to failures (Ric actually used the words “will fail”). Consequently, for some of the these operations we will need to develop a checkpoint/restart capability and even do only a minimal number of IO retries when hitting a bad sector (currently many file systems will try several times with an increasing amount of time between retries – this increases the amount of time the file system will try to read a bad sector, holding up the whole enumeration/read processes).
While it sounds fairly simple, and conceptually it is, Ric’s file system experiments really highlight some limitations in our current file systems. With the increasing pressure of massive amounts of data, having file systems that can scale to billions of files is going to become a requirement (not a “nice-to-have”).
Ric’s simple tests on file systems with 1 million files can easily be done by anyone using small enough files. But these experiments really draws attention to the differences in the file systems as well as sheer amount of time it can take to performance file system tasks. But the really good news is that it is definitely possible to function with file systems with 1 million files if we are a little more patient around the length of time to complete certain operations.
Ric’s 1 billion file experiment was the really cool final experiment and we learned a great deal from it. Firstly, we learned that we can actually build and run 1 billion files in our current file systems. They may be slower but they can function. However, the time it takes to performance certain functions have underscored the differences in the file systems. But just as any experiment that pushes the edges of the proverbial envelope, we learned some new things:
Faster storage hardware really helps (perhaps we need to find a way to get SSD’s to be more effectively used in Linux and/or our current file systems
Doing an “ls” command on a 1 billion file file system is not the best idea. It looks as though we need to rethink how to make this function and others more efficient on really large file systems.
Performing backups or remote replication is going to be a very long process. This means that we have to be ready for failures during the processes pointing out the need for some sort of checkpoint/restart capability.
I want to thank Ric for his great presentation and permission to use the images.