Lies, Damn Lies and File System Benchmarks

Benchmarking has become synonymous with marketeering to the point it is almost useless. This article takes a look at a very important paper that can demonstrate how bad it has become and makes recommendations on how to improve the situation.

“There are lies, damn lies, and then benchmarks.” It’s an overused phrase but it does make a point: Benchmarks have become so abused that no longer are they used to provide useful information for making decisions or for improving solutions. While this is something we seem to inherently know, it was recently backed up with solid data. The cliché, it turns out, is on the mark.

Recently a paper was published in Byte and Switch that examined nine years of storage and file system benchmarking. It’s an important paper. In this article we will summarise the paper’s findings. Looking ahead we plan to use them as the basis of future benchmarking results published in this column.

Yes, it’s that good.

Benchmarking is Just Another Word for Marketeering

Benchmarking is an activity intended to provide information about how fast a piece of software and/or hardware runs. Since this article is about file systems and storage, benchmarking means how fast the file system and/or storage solution performs IO operations.

Benchmarks can be very useful in helping present the benefit or problems of a system. However, rather than just present a simple table of graph of results and declare that one system is better than another, what is truly needed is a discussion of the tests and the systems, the reason behind the benchmarks, and a clear explanation of the results and implications. In particular the person reading the benchmarks should be able to verify the benchmark results themselves and then be able to compare the performance of one system to another.

To achieve these objectives the benchmarks much be well thought out; including advice on choosing suitable benchmarks, suitable hardware and configurations, and then providing as accurate results as possible.

However, over time benchmarks have been reduced to single graphs or tables with little or no explanation of them. In some cases the benchmarks are published naively and perhaps without adequate explanation (the author is guilty of this) sometimes in the interest of expediency. But in other cases benchmarks are published with little or no information about what was done solely to promote or detract from a particular product. In other words, benchmarks have become marketing material instead of usable information.

Nine-Year Review of Benchmarks

Recently there was a paper published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, “A Nine Year Study of File System and Storage Benchmarking” (Note: a summary of the paper can be found at this link). The paper examines 415 file systems and storage benchmarks from 106 recent papers. Based on this examination the paper makes some very interesting observations and conclusions that are, in many ways, very critical of the way “research” papers have been written about storage and file systems. These results are important to good benchmarking. And, stepping back from that, they make recommendations on how to perform good benchmarks (or at the very minimum, “better” benchmarks).

The research included papers from the Symposium on Operating Systems Principles (SOSP), the Symposium on Operating Systems Design and Implementation (OSDI), the USENIX Conference on File and Storage Technologies (FAST), and the USENIX Annual Technical Conference (USENIX). The conferences range from 1999 through 2007. The criteria for the selection of papers was fairly involved but focused on papers of good quality that covered benchmarks focusing on performance not on correctness or capacity. Of the 106 papers surveyed, the researchers included 8 of their own.

When selecting the papers, they used two underlying themes or guidelines for evaluation:

  • Looking to see if the authors explained exactly what was done – providing details on the benchmarking process.

  • Finding out if the authors just didn’t explain what was done, but justified why it was done in that particular fashion. For example, explaining why comparing file systems is fair or why a particular benchmark was run

Breaking Down Good Benchmarks

Repetition One of the simplest things that can be done for a benchmark is to run the benchmark a number of times and report the median or average. In addition, it would be extremely easy (and helpful) to report some measure of the spread of the data such as a standard deviation. This allows the reader to get an idea of what kind of variation they could see if they tried to reproduce the results and it also allows readers to understand the overall performance over a period of time.

The paper examined the 106 benchmark papers for the number of times the benchmark was run. The table below is from the review paper for all 388 benchmarks examined and is broken down by conference. Since most of the time the data was unclear, it was assumed that each benchmark was run only once.

Table 1 – Statistics of Number of Runs by Conference

Conference Mean Standard Deviation Median
SOSP 2.1 2.4 1
FAST 3.6 3.6 1
OSDI 3.8 4.3 2
USENIX 4.7 6.2 3

It is fairly obvious that the dispersion in the data is quite large. In some cases the standard deviation is as large or larger than the mean value.

Runtime The next topic examined is the runtime of the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified the elapsed time of the benchmark. From this data, it was found:

  • 28.6% of the benchmarks ran for less than one minute

  • 58.3% ran for less than 5 minutes

  • 70.9% ran for less than 10 minutes

Typically run times that are short (less than one minute) are too fast to achieve any sort of steady-state value.

With 49% of the benchmarks having no known runtime and another 28.6% running for less than a minute, easily three-quarters of these results should cause some of your warning bells to start ringing. If there’s no data, it’s not a benchmark; it’s an advertisement.

Variety of Benchmarks The third topic examined was the number of benchmarks run in the papers. It was found that 37.7% of the papers used only one or two benchmarks. This makes it very difficult to understand the true performance of the system because a single benchmark presents only one aspect of the system.

After performing the qualitative examination of the papers and benchmarks, the authors proceeded to examine many of the common benchmarks. They divided the group into several pieces:

  • Macro-Benchmarks with the following examples:

    • Postmark
    • Compile Benchmarksing (e.g. compiling the kernel)
    • The Andrew File System Benchmark
    • TPC (Transaction Processing Performance Council)
    • SPEC (SFS, SDM, Viewperf, Web99)
    • SPC (Storage Performance Council)
    • NetNews
    • Netbench and dbench (not used very often)

  • Replaying Traces
  • Micro-Benchmarks with the following examples:

    • Bonnie and Bonnie++
    • Sprite LFS
    • Ad-Hoc Micro-Benchmarks

  • System Utilities (e.g. “wc”, “cp”, “diff”, “tar”)
  • Configurable Workload Generators

    • Iometer
    • Buttress
    • FileBench
    • Fstress

Popular Benchmarks != Correct Benchmarks

The researchers then decided to take the two most popular benchmarks, Postmark and Compile, and do some more quantitative analysis to examine how they functioned and what kind of information they could provide. To do this they took the ext2 file system and modified it to slow down certain operations (they called it SlowFS). They slowed down reads (reading data from disk), prepare write and commit write, and lookup. The slow down was variable depending upon mount point options. Please note that this type of slowdown exercises the CPU, not the actual IO.

For the compile benchmark they focused on compiling OpenSSH. The compile function is predominantly driven by read functions so they slowed down the read operations by a factor of 32. They found that even at these extreme factors, the execution time for the compile only increased by 4.5% I think this shows that compiling something is perhaps not the best benchmark since variations in storage or file system will be little noticed in the elapsed time. This benchmark is dominated by CPU time to do the actual compilation and not necessarily IO.

For the Postmark benchmark, they slowed down the previously mentioned operations by a factor of 4 (separately and together) for 3 different “configurations” or set of Postmark operations. The researchers found that using three different Postmark runs they got widely varying run times with Postmark – from 2 seconds to 214 seconds (the 2 second operation barely produced any IO). The other observation they made was that different sets of parameters for Postmark showed more of the SlowFS effects than others.

Eating Their Own Dogfood

Finally, the authors made some observations and conclusions from their work.

  • They recommend that using the current set of available benchmarks that at least one macro-benchmark or actual application trace be tested as well as several micro-benchmarks. In essence, use both marco-benchmarks and micro-benchmarks and several several to better gauge the performance including where areas where the system performs well or doesn’t perform well.
  • Benchmarks should definitely improve the descriptions of what was done as well as why it was done (this second point was emphasized in the paper).

    • Furthermore the author’s offer the opinion, with good reason, that there should be some analysis of the the system’s expected behavior as well as various benchmarks that either prove or disprove the hypothesis (this goes to the “why” of the benchmark). This goes well beyond a simple graph or table that are so typically shown.

  • The current state of performance evaluations has a great deal of room for improvement.

    • The state the standards clearly need to be raised
    • They also state that there needs to be better dissemination
    • There need to be better and standardized benchmarks for file systems and storage testing

  • Finally, the authors question the usefulness of standardized industrial benchmarks since they are usually used to report a single number, not to help characterize or benchmark a complex system (i.e. think of the usefulness of the TPC and SPC numbers you see – do they present any useful information to you?)

Summary and Observations

The authors of the paper took a wide range of research oriented benchmarks from reputable conferences and performed a qualitative analysis of them. The results are both extremely interesting and somewhat depressing. From a higher perspective they found:

  • Much of the time, the benchmarks are run only once and in some cases the testing time is so short that the results may be of little use.
  • There is little or no explanation as to why a benchmark was run
  • There is little or no information about the run so that it could be repeated by someone else

  • Some of the benchmarks may not be useful in helping to characterize or benchmark a storage system

In short: You’re doing it wrong.

The paper discussed some recommendations about ways to improve benchmarking which everyone should take to heart. In particular, benchmarks should be run multiple times and be presented with some sort of dispersion measure (e.g. standard deviation, etc). Perhaps more importantly when the benchmark results are published there should be a discussion of what is hoped to be shown with the benchmarks as well as why certain benchmarks were chosen. It is hoped that all future benchmarks whether you, the reader, runs them, or whether you read benchmarks run by someone else, will have this information in the results.

Comments on "Lies, Damn Lies and File System Benchmarks"


Great points. Marketeering, and most other for-profit (and government) areas seem to have a problem with ethical presentation of their products, facts, and positions. From healthcare reform proposal \”facts\” (won\’t have to ever change providers) to food ingredients (100% whole wheat, now often has \”gluten\” as the third ingredient. Gluten is the part of wheat that remains in white bread), we are being barraged with false advertising.

You can fool some of the people all the time…but you can\’t fool all the people all the time. as a consumer-oriented free society, we need to we need to call out these questionable ethics more often–false advertising. And then we need to not support those who do this stuff. But it will take all of us acting.

BTW: Runtime is awful, but not as bad as Layton states (49% plus another 28.7% is really only 49% plus 28.7% of the remaining 51%).


Nice Explanation.

Benchmarks are not for Engineers/Technically knowledge peoples. They are for junk Marketing/Management peoples to brainwash the customers. There exists lots of Benchmarks, even some like EEMBC, SPEC sell their benchmark software\’s for thousands of dollars which is worth of nothing for technical point of view, they are simply compiler benchmarks, they give how efficient/optimized the output code generated by a specific compiler rather than testing the intended target.

After working in lots of Benchmark tools and deep study in it for quite a long time I discovered it just to be a marketing tool which is no good for engineers.


Glad you enjoyed the articles, and thanks for taking part in spreading the word!

Woah this kind of blog site will be amazing i like learning your posts. Remain up the great art! You already know, plenty of people want all around for this facts, you can encourage them to greatly.

Muchos Gracias for your blog.Really thank you! Cool.

I was wondering if you ever considered changing the layout of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 pictures. Maybe you could space it out better?

Really informative post.Really looking forward to read more. Keep writing.

EWTLr5 Interesting article. Were did you got all the information from?

Every when inside a though we pick out blogs that we study. Listed below would be the latest web-sites that we choose.

We like to honor a lot of other internet websites on the net, even though they aren?t linked to us, by linking to them. Below are some webpages worth checking out.

One of our visitors just lately suggested the following website.

Every once in a although we select blogs that we study. Listed below would be the most up-to-date web-sites that we choose.

One of our guests just lately recommended the following website.

Generally I do not learn post on blogs, however I would like to say that this write-up very forced me to check out and do so! Your writing style has been surprised me. Thank you, very nice article.

That could be the end of this article. Right here you?ll come across some web pages that we believe you?ll appreciate, just click the links.

Check below, are some completely unrelated internet websites to ours, even so, they may be most trustworthy sources that we use.

Im thankful for the post.Really thank you! Much obliged.

I every time emailed this blog post page to all my associates, as if like to read it after that my links will too.
ofertas salomon [url=http://www.gmplanejados.com.br/tdos.php?es=ofertas-salomon]ofertas salomon[/url]

Lies, Damn Lies and File System Benchmarks | Linux Magazine
oxijqyxlmi http://www.gxd8vkdrgb228he1fc538547729p55cos.org/

Undeniably believe that which you stated. Your favorite reason appeared to be on the internet the easiest thing to be aware of. I say to you, I certainly get annoyed while people think about worries that they just don’t know about. You managed to hit the nail upon the top and defined out the whole thing without having side-effects , people can take a signal. Will likely be back to get more. Thanks
zapatillas montaña salomon

wow, awesome blog post.Really thank you! Cool.

I have read so many articles or reviews on the topic of the blogger lovers except this paragraph is genuinely a fastidious article, keep it up.|
nike free men nike free 5.0 tr fit 5 print

Its like you read my thoughts! You seem to grasp a lot about this, such as you wrote the book in it or something. I feel that you simply can do with a few p.c. to drive the message house a bit, however instead of that, that is excellent blog. A fantastic read. I will certainly be back.|
nike free run norge nike free salg

Hello, just wanted to mention, I enjoyed this blog post. It was funny. Keep on posting!|
nike free 4.0 nike free 3.0 black

it’s good posting.

I’m curious to find out what blog system you’re utilizing? I’m experiencing some minor security issues with my latest blog and I would like to find something more secure. Do you have any suggestions?|
free nike nike free 5.0 tr fit 5

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Leave a Reply