Lies, Damn Lies and File System Benchmarks

Benchmarking has become synonymous with marketeering to the point it is almost useless. This article takes a look at a very important paper that can demonstrate how bad it has become and makes recommendations on how to improve the situation.

“There are lies, damn lies, and then benchmarks.” It’s an overused phrase but it does make a point: Benchmarks have become so abused that no longer are they used to provide useful information for making decisions or for improving solutions. While this is something we seem to inherently know, it was recently backed up with solid data. The cliché, it turns out, is on the mark.

Recently a paper was published in Byte and Switch that examined nine years of storage and file system benchmarking. It’s an important paper. In this article we will summarise the paper’s findings. Looking ahead we plan to use them as the basis of future benchmarking results published in this column.

Yes, it’s that good.

Benchmarking is Just Another Word for Marketeering

Benchmarking is an activity intended to provide information about how fast a piece of software and/or hardware runs. Since this article is about file systems and storage, benchmarking means how fast the file system and/or storage solution performs IO operations.

Benchmarks can be very useful in helping present the benefit or problems of a system. However, rather than just present a simple table of graph of results and declare that one system is better than another, what is truly needed is a discussion of the tests and the systems, the reason behind the benchmarks, and a clear explanation of the results and implications. In particular the person reading the benchmarks should be able to verify the benchmark results themselves and then be able to compare the performance of one system to another.

To achieve these objectives the benchmarks much be well thought out; including advice on choosing suitable benchmarks, suitable hardware and configurations, and then providing as accurate results as possible.

However, over time benchmarks have been reduced to single graphs or tables with little or no explanation of them. In some cases the benchmarks are published naively and perhaps without adequate explanation (the author is guilty of this) sometimes in the interest of expediency. But in other cases benchmarks are published with little or no information about what was done solely to promote or detract from a particular product. In other words, benchmarks have become marketing material instead of usable information.

Nine-Year Review of Benchmarks

Recently there was a paper published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, “A Nine Year Study of File System and Storage Benchmarking” (Note: a summary of the paper can be found at this link). The paper examines 415 file systems and storage benchmarks from 106 recent papers. Based on this examination the paper makes some very interesting observations and conclusions that are, in many ways, very critical of the way “research” papers have been written about storage and file systems. These results are important to good benchmarking. And, stepping back from that, they make recommendations on how to perform good benchmarks (or at the very minimum, “better” benchmarks).

The research included papers from the Symposium on Operating Systems Principles (SOSP), the Symposium on Operating Systems Design and Implementation (OSDI), the USENIX Conference on File and Storage Technologies (FAST), and the USENIX Annual Technical Conference (USENIX). The conferences range from 1999 through 2007. The criteria for the selection of papers was fairly involved but focused on papers of good quality that covered benchmarks focusing on performance not on correctness or capacity. Of the 106 papers surveyed, the researchers included 8 of their own.

When selecting the papers, they used two underlying themes or guidelines for evaluation:

  • Looking to see if the authors explained exactly what was done – providing details on the benchmarking process.

  • Finding out if the authors just didn’t explain what was done, but justified why it was done in that particular fashion. For example, explaining why comparing file systems is fair or why a particular benchmark was run

Breaking Down Good Benchmarks

Repetition One of the simplest things that can be done for a benchmark is to run the benchmark a number of times and report the median or average. In addition, it would be extremely easy (and helpful) to report some measure of the spread of the data such as a standard deviation. This allows the reader to get an idea of what kind of variation they could see if they tried to reproduce the results and it also allows readers to understand the overall performance over a period of time.

The paper examined the 106 benchmark papers for the number of times the benchmark was run. The table below is from the review paper for all 388 benchmarks examined and is broken down by conference. Since most of the time the data was unclear, it was assumed that each benchmark was run only once.

Table 1 – Statistics of Number of Runs by Conference

Conference Mean Standard Deviation Median
SOSP 2.1 2.4 1
FAST 3.6 3.6 1
OSDI 3.8 4.3 2
USENIX 4.7 6.2 3

It is fairly obvious that the dispersion in the data is quite large. In some cases the standard deviation is as large or larger than the mean value.

Runtime The next topic examined is the runtime of the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified the elapsed time of the benchmark. From this data, it was found:

  • 28.6% of the benchmarks ran for less than one minute

  • 58.3% ran for less than 5 minutes

  • 70.9% ran for less than 10 minutes

Typically run times that are short (less than one minute) are too fast to achieve any sort of steady-state value.

With 49% of the benchmarks having no known runtime and another 28.6% running for less than a minute, easily three-quarters of these results should cause some of your warning bells to start ringing. If there’s no data, it’s not a benchmark; it’s an advertisement.

Variety of Benchmarks The third topic examined was the number of benchmarks run in the papers. It was found that 37.7% of the papers used only one or two benchmarks. This makes it very difficult to understand the true performance of the system because a single benchmark presents only one aspect of the system.

After performing the qualitative examination of the papers and benchmarks, the authors proceeded to examine many of the common benchmarks. They divided the group into several pieces:

  • Macro-Benchmarks with the following examples:

    • Postmark
    • Compile Benchmarksing (e.g. compiling the kernel)
    • The Andrew File System Benchmark
    • TPC (Transaction Processing Performance Council)
    • SPEC (SFS, SDM, Viewperf, Web99)
    • SPC (Storage Performance Council)
    • NetNews
    • Netbench and dbench (not used very often)

  • Replaying Traces
  • Micro-Benchmarks with the following examples:

    • Bonnie and Bonnie++
    • Sprite LFS
    • Ad-Hoc Micro-Benchmarks

  • System Utilities (e.g. “wc”, “cp”, “diff”, “tar”)
  • Configurable Workload Generators

    • Iometer
    • Buttress
    • FileBench
    • Fstress

Popular Benchmarks != Correct Benchmarks

The researchers then decided to take the two most popular benchmarks, Postmark and Compile, and do some more quantitative analysis to examine how they functioned and what kind of information they could provide. To do this they took the ext2 file system and modified it to slow down certain operations (they called it SlowFS). They slowed down reads (reading data from disk), prepare write and commit write, and lookup. The slow down was variable depending upon mount point options. Please note that this type of slowdown exercises the CPU, not the actual IO.

For the compile benchmark they focused on compiling OpenSSH. The compile function is predominantly driven by read functions so they slowed down the read operations by a factor of 32. They found that even at these extreme factors, the execution time for the compile only increased by 4.5% I think this shows that compiling something is perhaps not the best benchmark since variations in storage or file system will be little noticed in the elapsed time. This benchmark is dominated by CPU time to do the actual compilation and not necessarily IO.

For the Postmark benchmark, they slowed down the previously mentioned operations by a factor of 4 (separately and together) for 3 different “configurations” or set of Postmark operations. The researchers found that using three different Postmark runs they got widely varying run times with Postmark – from 2 seconds to 214 seconds (the 2 second operation barely produced any IO). The other observation they made was that different sets of parameters for Postmark showed more of the SlowFS effects than others.

Eating Their Own Dogfood

Finally, the authors made some observations and conclusions from their work.

  • They recommend that using the current set of available benchmarks that at least one macro-benchmark or actual application trace be tested as well as several micro-benchmarks. In essence, use both marco-benchmarks and micro-benchmarks and several several to better gauge the performance including where areas where the system performs well or doesn’t perform well.
  • Benchmarks should definitely improve the descriptions of what was done as well as why it was done (this second point was emphasized in the paper).

    • Furthermore the author’s offer the opinion, with good reason, that there should be some analysis of the the system’s expected behavior as well as various benchmarks that either prove or disprove the hypothesis (this goes to the “why” of the benchmark). This goes well beyond a simple graph or table that are so typically shown.

  • The current state of performance evaluations has a great deal of room for improvement.

    • The state the standards clearly need to be raised
    • They also state that there needs to be better dissemination
    • There need to be better and standardized benchmarks for file systems and storage testing

  • Finally, the authors question the usefulness of standardized industrial benchmarks since they are usually used to report a single number, not to help characterize or benchmark a complex system (i.e. think of the usefulness of the TPC and SPC numbers you see – do they present any useful information to you?)

Summary and Observations

The authors of the paper took a wide range of research oriented benchmarks from reputable conferences and performed a qualitative analysis of them. The results are both extremely interesting and somewhat depressing. From a higher perspective they found:

  • Much of the time, the benchmarks are run only once and in some cases the testing time is so short that the results may be of little use.
  • There is little or no explanation as to why a benchmark was run
  • There is little or no information about the run so that it could be repeated by someone else

  • Some of the benchmarks may not be useful in helping to characterize or benchmark a storage system

In short: You’re doing it wrong.

The paper discussed some recommendations about ways to improve benchmarking which everyone should take to heart. In particular, benchmarks should be run multiple times and be presented with some sort of dispersion measure (e.g. standard deviation, etc). Perhaps more importantly when the benchmark results are published there should be a discussion of what is hoped to be shown with the benchmarks as well as why certain benchmarks were chosen. It is hoped that all future benchmarks whether you, the reader, runs them, or whether you read benchmarks run by someone else, will have this information in the results.

Comments on "Lies, Damn Lies and File System Benchmarks"

Its like you read my mind! You appear to know so much about this, like you wrote the book in it or something. I think that you could do with some pics to drive the message home a little bit, but other than that, this is great blog. A great read. I will certainly be back.

Its not my first time to go to see this web page, i am visiting this web site dailly and obtain good facts from here daily.|
nike free nike free run white

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Thank you for the good writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! By the way, how could we communicate?

Greetings, I do believe your website could be having web browser compatibility problems. Whenever I look at your site in Safari, it looks fine but when opening in IE, it has some overlapping issues. I merely wanted to provide you with a quick heads up! Apart from that, fantastic website!
asics gel galaxy 7 precio

Quite! It has been a post that is really superb. Thanks for delivering this info.
Airconditioning repair

Hello there I am so grateful I found your web site, I really found you by mistake, while I was searching on Google for something else, Anyhow I am here now and would just like to say thanks a lot for a marvelous post and a all round thrilling blog (I also love the theme/design), I don’t have time to read it all at the moment but I have saved it and also added in your RSS feeds, so when I have time I will be back to read more, Please do keep up the fantastic jb.
Popular Oakley gafas de sol baratas Frogskin precio de 70335

No bother Laura. Happy to unfold a love that is Gaelic that is little around. Trust the tracks are coming along nicely.
Alice in wonderland cheshire cat

My brother suggested I may like this blog. He used to be totally right. This publish actually made my day. You can not consider simply how a lot time I had spent for this info! Thanks!
scarpe nike air force 2014

I have always been fascinated by cartoon, this is a good source full of some quality documents about them. Congratulations.
Cyanide and happiness youtube

Rather! This has been a guide that is really amazing. Thanks for giving this information.
Spinal tracts

Excellent site you have here.. It’s hard to find high quality writing like yours these days. I really appreciate people like you! Take care!!|
nike free mens nike free run 3.0

I have always been drawn towards the animation art work. Appreciate your site. Very insightful. Will certainly follow you. Anticipating for more good information.
South bay scooter club

Hi there! This post could not be written any better! Reading this post reminds me of my old room mate! He always kept talking about this. I will forward this write-up to him. Fairly certain he will have a good read. Thank you for sharing!
España Nike Air Max 90 Lunar SP Moon Landing Apollo 11 Mujer Zapatos Neil Armstrong Gray American Flag Usa Cheap Envío Gratis

After I initially left a comment I seem to have clicked on the -Notify me when new comments are added- checkbox and from now on every time a comment is added I recieve 4 emails with the same comment. Is there a means you are able to remove me from that service? Thanks!|
nike free run 4 nike free 5.0 womens

Great article. I’m dealing with many of these issues as well..|
nike free mens nike free barn

No bother Laura. Pleased to distribute a love that is Gaelic that is little . Desire the melodies are coming along perfectly.
Youtube royalty free music

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Hi Pleasant i added the type temporerly first and that I might incorporate the big event later.Do you’ve any idea that in copyscape the code being shown by it at Comments’ bottom.
How to use a vaporizer

Shop for NBA jerseys at the official NBA Store! We carry the widest variety of rose nba jerseys, and youth sizes. Keep checking back for the arrivals of the NBA Nike Jersey!
nba jerseys 2016

Shop for NBA jerseys at the official NBA Store! We carry the widest variety of nba jerseys cheap, and youth sizes. Keep checking back for the arrivals of the NBA Nike Jersey!

I’ve been intrigued by animation, this is a great resource packed with some quality forms on the subject. Welldone.
Saw cutting

I all the time used to read piece of writing in news papers but now as I am a user of web so from now I am using net for content, thanks to web.
ray ban 2132 polarized

Everything is very open with a very clear clarification of the challenges. It was really informative. Your website is very helpful. Many thanks for sharing!
nike shox uomo

Shop for NBA jerseys at the official NBA Store! We carry the widest variety of kobe nba jerseys, and youth sizes. Keep checking back for the arrivals of the NBA Nike Jersey!
nba throwback jersey

Leave a Reply