Lies, Damn Lies and File System Benchmarks

Benchmarking has become synonymous with marketeering to the point it is almost useless. This article takes a look at a very important paper that can demonstrate how bad it has become and makes recommendations on how to improve the situation.

“There are lies, damn lies, and then benchmarks.” It’s an overused phrase but it does make a point: Benchmarks have become so abused that no longer are they used to provide useful information for making decisions or for improving solutions. While this is something we seem to inherently know, it was recently backed up with solid data. The cliché, it turns out, is on the mark.

Recently a paper was published in Byte and Switch that examined nine years of storage and file system benchmarking. It’s an important paper. In this article we will summarise the paper’s findings. Looking ahead we plan to use them as the basis of future benchmarking results published in this column.

Yes, it’s that good.

Benchmarking is Just Another Word for Marketeering

Benchmarking is an activity intended to provide information about how fast a piece of software and/or hardware runs. Since this article is about file systems and storage, benchmarking means how fast the file system and/or storage solution performs IO operations.

Benchmarks can be very useful in helping present the benefit or problems of a system. However, rather than just present a simple table of graph of results and declare that one system is better than another, what is truly needed is a discussion of the tests and the systems, the reason behind the benchmarks, and a clear explanation of the results and implications. In particular the person reading the benchmarks should be able to verify the benchmark results themselves and then be able to compare the performance of one system to another.

To achieve these objectives the benchmarks much be well thought out; including advice on choosing suitable benchmarks, suitable hardware and configurations, and then providing as accurate results as possible.

However, over time benchmarks have been reduced to single graphs or tables with little or no explanation of them. In some cases the benchmarks are published naively and perhaps without adequate explanation (the author is guilty of this) sometimes in the interest of expediency. But in other cases benchmarks are published with little or no information about what was done solely to promote or detract from a particular product. In other words, benchmarks have become marketing material instead of usable information.

Nine-Year Review of Benchmarks

Recently there was a paper published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, “A Nine Year Study of File System and Storage Benchmarking” (Note: a summary of the paper can be found at this link). The paper examines 415 file systems and storage benchmarks from 106 recent papers. Based on this examination the paper makes some very interesting observations and conclusions that are, in many ways, very critical of the way “research” papers have been written about storage and file systems. These results are important to good benchmarking. And, stepping back from that, they make recommendations on how to perform good benchmarks (or at the very minimum, “better” benchmarks).

The research included papers from the Symposium on Operating Systems Principles (SOSP), the Symposium on Operating Systems Design and Implementation (OSDI), the USENIX Conference on File and Storage Technologies (FAST), and the USENIX Annual Technical Conference (USENIX). The conferences range from 1999 through 2007. The criteria for the selection of papers was fairly involved but focused on papers of good quality that covered benchmarks focusing on performance not on correctness or capacity. Of the 106 papers surveyed, the researchers included 8 of their own.

When selecting the papers, they used two underlying themes or guidelines for evaluation:

  • Looking to see if the authors explained exactly what was done – providing details on the benchmarking process.

  • Finding out if the authors just didn’t explain what was done, but justified why it was done in that particular fashion. For example, explaining why comparing file systems is fair or why a particular benchmark was run

Breaking Down Good Benchmarks

Repetition One of the simplest things that can be done for a benchmark is to run the benchmark a number of times and report the median or average. In addition, it would be extremely easy (and helpful) to report some measure of the spread of the data such as a standard deviation. This allows the reader to get an idea of what kind of variation they could see if they tried to reproduce the results and it also allows readers to understand the overall performance over a period of time.

The paper examined the 106 benchmark papers for the number of times the benchmark was run. The table below is from the review paper for all 388 benchmarks examined and is broken down by conference. Since most of the time the data was unclear, it was assumed that each benchmark was run only once.

Table 1 – Statistics of Number of Runs by Conference

Conference Mean Standard Deviation Median
SOSP 2.1 2.4 1
FAST 3.6 3.6 1
OSDI 3.8 4.3 2
USENIX 4.7 6.2 3

It is fairly obvious that the dispersion in the data is quite large. In some cases the standard deviation is as large or larger than the mean value.

Runtime The next topic examined is the runtime of the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified the elapsed time of the benchmark. From this data, it was found:

  • 28.6% of the benchmarks ran for less than one minute

  • 58.3% ran for less than 5 minutes

  • 70.9% ran for less than 10 minutes

Typically run times that are short (less than one minute) are too fast to achieve any sort of steady-state value.

With 49% of the benchmarks having no known runtime and another 28.6% running for less than a minute, easily three-quarters of these results should cause some of your warning bells to start ringing. If there’s no data, it’s not a benchmark; it’s an advertisement.

Variety of Benchmarks The third topic examined was the number of benchmarks run in the papers. It was found that 37.7% of the papers used only one or two benchmarks. This makes it very difficult to understand the true performance of the system because a single benchmark presents only one aspect of the system.

After performing the qualitative examination of the papers and benchmarks, the authors proceeded to examine many of the common benchmarks. They divided the group into several pieces:

  • Macro-Benchmarks with the following examples:

    • Postmark
    • Compile Benchmarksing (e.g. compiling the kernel)
    • The Andrew File System Benchmark
    • TPC (Transaction Processing Performance Council)
    • SPEC (SFS, SDM, Viewperf, Web99)
    • SPC (Storage Performance Council)
    • NetNews
    • Netbench and dbench (not used very often)

  • Replaying Traces
  • Micro-Benchmarks with the following examples:

    • Bonnie and Bonnie++
    • Sprite LFS
    • Ad-Hoc Micro-Benchmarks

  • System Utilities (e.g. “wc”, “cp”, “diff”, “tar”)
  • Configurable Workload Generators

    • Iometer
    • Buttress
    • FileBench
    • Fstress

Popular Benchmarks != Correct Benchmarks

The researchers then decided to take the two most popular benchmarks, Postmark and Compile, and do some more quantitative analysis to examine how they functioned and what kind of information they could provide. To do this they took the ext2 file system and modified it to slow down certain operations (they called it SlowFS). They slowed down reads (reading data from disk), prepare write and commit write, and lookup. The slow down was variable depending upon mount point options. Please note that this type of slowdown exercises the CPU, not the actual IO.

For the compile benchmark they focused on compiling OpenSSH. The compile function is predominantly driven by read functions so they slowed down the read operations by a factor of 32. They found that even at these extreme factors, the execution time for the compile only increased by 4.5% I think this shows that compiling something is perhaps not the best benchmark since variations in storage or file system will be little noticed in the elapsed time. This benchmark is dominated by CPU time to do the actual compilation and not necessarily IO.

For the Postmark benchmark, they slowed down the previously mentioned operations by a factor of 4 (separately and together) for 3 different “configurations” or set of Postmark operations. The researchers found that using three different Postmark runs they got widely varying run times with Postmark – from 2 seconds to 214 seconds (the 2 second operation barely produced any IO). The other observation they made was that different sets of parameters for Postmark showed more of the SlowFS effects than others.

Eating Their Own Dogfood

Finally, the authors made some observations and conclusions from their work.

  • They recommend that using the current set of available benchmarks that at least one macro-benchmark or actual application trace be tested as well as several micro-benchmarks. In essence, use both marco-benchmarks and micro-benchmarks and several several to better gauge the performance including where areas where the system performs well or doesn’t perform well.
  • Benchmarks should definitely improve the descriptions of what was done as well as why it was done (this second point was emphasized in the paper).

    • Furthermore the author’s offer the opinion, with good reason, that there should be some analysis of the the system’s expected behavior as well as various benchmarks that either prove or disprove the hypothesis (this goes to the “why” of the benchmark). This goes well beyond a simple graph or table that are so typically shown.

  • The current state of performance evaluations has a great deal of room for improvement.

    • The state the standards clearly need to be raised
    • They also state that there needs to be better dissemination
    • There need to be better and standardized benchmarks for file systems and storage testing

  • Finally, the authors question the usefulness of standardized industrial benchmarks since they are usually used to report a single number, not to help characterize or benchmark a complex system (i.e. think of the usefulness of the TPC and SPC numbers you see – do they present any useful information to you?)

Summary and Observations

The authors of the paper took a wide range of research oriented benchmarks from reputable conferences and performed a qualitative analysis of them. The results are both extremely interesting and somewhat depressing. From a higher perspective they found:

  • Much of the time, the benchmarks are run only once and in some cases the testing time is so short that the results may be of little use.
  • There is little or no explanation as to why a benchmark was run
  • There is little or no information about the run so that it could be repeated by someone else

  • Some of the benchmarks may not be useful in helping to characterize or benchmark a storage system

In short: You’re doing it wrong.

The paper discussed some recommendations about ways to improve benchmarking which everyone should take to heart. In particular, benchmarks should be run multiple times and be presented with some sort of dispersion measure (e.g. standard deviation, etc). Perhaps more importantly when the benchmark results are published there should be a discussion of what is hoped to be shown with the benchmarks as well as why certain benchmarks were chosen. It is hoped that all future benchmarks whether you, the reader, runs them, or whether you read benchmarks run by someone else, will have this information in the results.

Comments on "Lies, Damn Lies and File System Benchmarks"

Hi are using WordPress for your blog platform? I’m new to the blog world but I’m trying to get started and create my own. Do you require any html coding knowledge to make your own blog? Any help would be greatly appreciated!|
nike free 5.0 women nike free run mens

Hello there, I discovered your site by means of Google at the same time as searching for a comparable subject, your web site came up, it appears good. I’ve bookmarked it in my google bookmarks
Cappellino di tela Patria LRG – Nero – arrivare a buon mercato

This post is everyone’s that is worth consideration. When can I find more out?
Shower tub faucet repair

Lies, Damn Lies and File System Benchmarks | Linux Magazine

that’s nice posting.

Hello Nice i included the course temporerly first and i would incorporate the big event later.Do you have any idea that in copyscape it showing the rule in the bottom of Reviews.
Teen titans episodes youtube

Many people are advocating the usefulness of coupons. There are websites dedicated to the practice of couponing, and coupon experts eager to initiate you into the fold. Like anything that’s worth doing, though, getting the most out coupons requires a little bit of planning and work. This article will assist you if you wish to save money with coupons.
oakley canada

Wollte Euch einfach mal in diesem Gaestebuch einen Gruss hinterlassen. :)|I’ve been browsing online more than 2 hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. Personally, if all site owners and bloggers made good content as you did, the net will be much more useful than ever before.|
nike free joggesko nike free run 2 dame

Hi there! Do you know if they make any plugins to assist with Search Engine Optimization? I’m trying to get my blog to rank for some targeted keywords but I’m not seeing very good success. If you know of any lease share. Kudos!
nfl x nike air max 95 no-sew

cartierlovejesduas Thanks Lukas, can you also provide the make and model of your keyboard?
copy cartier bracelet love

I’ve always been drawn towards the cartoon fine art. Love your website. Very informative. You are definitely followed by will. Excited for more great information.
Large truck accidents

My brother suggested I might like this website. He was once totally right. This submit truly made my day. You can not believe simply how so much time I had spent for this info! Thanks!
Breaker Felpa DC – Zinfandel – funzionale

Sustain the exceptional work !! Lovin’ it!|
black nike free run nike free run 5.0 norge

Thanks in favor of sharing such a fastidious opinion, paragraph is fastidious, thats why i have read it fully|
nike free 2 nike free sko

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Is that true? Ill spread this information. Anyway, nice posting.

Shop for NBA jerseys at the official NBA Store! We carry the widest variety of durant nba jerseys, and youth sizes. Keep checking back for the arrivals of the NBA Nike Jersey!
durant nba jerseys

cartierlovejesduas If Planned Parenthood is sponsoring the event then they have a reason. The only reason I can see that Planned Parenthood would sponsor such an event is if they are going to get something back, because they are a greedy organization that is continuing because of the death of innocents. They definitely expect a return… and at what real cost?
love bracelet réplique

Who couldn’t use help in this economy? Using coupons can be a way to make the most of your budget. Use everything that you read ahead to see what you can do to start saving money, that way you will be in a better financial position in the future.
Yeezy Replica

Chapter 6 is the liquidation of one’s debt.95.., Oakland, Modesto, Santa Barbara, Sacramento, Yuba City, San Luis Obispo, Santa Barbara, Ventura, Redding.2. Now, there are thousands of business opportunities on the internet, and there are even more ways to advertise and promote your website.However, like stated before, you don’t need to buy a game system to do them. This mentality then makes them prey to all of those business opportunities out there that promise that they will get rich overnight. And many individuals might declare, “nicely that is not sufficient, my auto is worth additional.Yes, that’s right; there is also quite a variety for vegetarians.
Wholesale MLB Jeryseys

Fairly! It has been a truly superb post. Thankyou for delivering this information.
The walking dead theme song

Lies, Damn Lies and File System Benchmarks | Linux Magazine

Leave a Reply