Fighting Performance Regression

Arjan Van De Ven shines some light on the issues, projects, and techniques that characterize kernel performance.

It’s the worst nightmare of every system administrator: a well-working server with a perfect track record gets a security upgrade of the kernel and suddenly the performance drops and the machine can’t handle the workload anymore.

The rate of change in the Linux kernel is really high: Greg Kroah-Hartman recently showed that each hour, 4 patches are applied to the Linux kernel, 24 hours a day, 365 days a year. A few million lines of code change every kernel release. Yet despite this high rate of change, the nightmare scenario mentioned above isn’t very likely.

This article shines some light on the issues, projects, and techniques that characterize kernel performance — ensuring system administrators around the world can sleep well at night.

Benchmarks and Performance Testing

Testing performance may sound easy, but in reality, it’s a really tricky and complex topic that requires a lot of skill, great attention to detail, and mounds of patience. Anyone can Google for a Linux benchmark and get two and a half million hits. Anyone can pick a top result, run the program, see the same or a higher number as the day before, declare victory, and ship the patch to Linus with a “this patch is perfect” comment, right? Sadly, things aren’t so easy.

Picking and running a useful benchmark actually requires some thought and investigation. A few things are very important when picking a benchmark, and they may sound obvious, but sadly many of the results you found on Google fail to meet the mark.

One of the most important things is that the benchmark gives reproducible results with a low variation. If the benchmark gives results that go up our down by 30 percent or more in the exact same situation, you can’t draw any conclusions about your kernel patch. However, if you get a 10 percent improvement on a benchmark whose results vary by less than 1 percent between runs in the same setup, you can feel comfortable in deciding your patch is doing something right.

A second thing to consider for a benchmark is whether or not the benchmark really represents some application or workload in the real world. Specifically, if the benchmark shows improvement, real usages of the system should also show improvement and visa versa. The infamous dbench benchmark fails this test for example — it’s easy to get a high score on the dbench benchmark by doing things in the kernel that significantly hurt most applications.

Another trap is picking a microb-enchmark that measures a too minute detail of the kernel. While you can draw conclusions about a specific operation in the kernel, you’ll miss the impact on total system behavior. A kernel change that improves your detail may have a huge cost elsewhere such that the net effect is negative.

Even after you find a good benchmark, you’re not done yet. A positive result on your laptop doesn’t always translate to the 16-core server at work, where performance is still horrible because of a global lock causing massive contention.

To effectively and reliably measure the performance impact of a kernel change, make sure your benchmarks give reproducible results with a low variation, correspond to real life usages, and are broad enough to capture system-wide effects. Make sure to also run your benchmarks on a set of different machines to capture scalability and other differences.

The Linux Kernel Performance Project

From the description above, it’s clear that not every kernel developer has the time, patience, and equipment to benchmark his or her patches for performance regressions. But don’t worry. The Linux Kernel Performance Project (LKP, http://kernel-perf.sf.net) run by Tim Chen, Yanmin Zhang, and Alex Shi is there to fill this gap.

The LKP team runs a set of known good benchmarks weekly on the development kernel from Linus (more often if Linus releases more kernels). If the team detects a regression on any of the benchmarks, they zoom in to identify the patch that caused the regression. When the culprit is found, Tim and company either directly fix the issue or contact the author of the patch for further discussion.

The benchmarks used by LKP include: OLTP (a database benchmark), the industry-standard Java benchmark, cpu2000, httperf, Netperf, IOZone, Tbench, ReAIM7, Volanomark, Sysbench, Aiostress, a kernel build, and mmbench.

As you can see, the LKP project runs a whole suite of benchmarks, which means each new kernel yields a whole new set of numbers. However, so much data often makes it difficult to see if a kernel is better or worse than a previous one, because there just isn’t a single answer.

To solve this dilemma, the LKP project maintains an index, or a weighted average of the results of the various benchmarks on the set of machines. The index (like the stock market index) is a single number that you can use to see if there is anything seriously wrong or really good in the new kernel version. More information about the exact composition of this index and the weights can be found on the project website of the LKP project.

Zooming into a Regression

A key step in the process of regression prevention is pinpointing the offending patch. However, since the tests run weekly, there may (and often, will) have been thousands of patches applied to the kernel. This is where a technique called bisecting comes in. Since this problem is more generic and applicable to many people, let’s go into some technical detail on how you could do this yourself using the git version control system as used by Linux.

The concept behind bisecting is that of binary search: Find a known, good starting point and couple it with the known, bad ending point. Next, take a point in the middle, test that, and if it passes the tests it becomes the new latest known good starting point. Each such bisection halves the range of suspect patches. Similarly, the list of suspects is halved when the test version fails the test. By repeating this a few times (at most ten times in the case of 1,000 patches), the set of culpable patches will shrink to (hopefully) one.

As first step, check out Linus’ git tree:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

This is the trunk, and assuming it’s bad (you just tested it after all), the following commands inform git of the problem:

# start bisection$ git bisect start# tell git that the current version is bad$ git bisect bad  

The next step is to find the last known, good version. Let’s assume that the 2.6.23 kernel is known to be good:

$ git bisect good v2.6.23

This forces git to choose a middle point. You can now compile the kernel and test it. If the test is successful, enter:

$ git bisect good

Or, if it fails the test, type:

$ git bisect bad

Either command causes git to pick a new “middle point”, leading to another compile and test cycle — until there’s only one commit left. The “git bisect log” command can then be used to report all the steps you performed.

Examples of Regression Intervention

During the development of 2.6.15, a scheduler cleanup was merged that accidentally changed behavior, cuasing the scheduler to not balance workload properly. The error was most noticeable on Itanium systems, and the Volano benchmark degraded 60 percent. The LKP project quickly identified the cause and reported the problem. The next release candidate kernel had the issue fixed.

During 2.6.21 development, a change was made to the CFQ I/O scheduler intended to be an optimization. However the new behavior actually slowed performance 15 percent for multithreaded, read/write I/O (while working beautifully for single-threaded I/O). The LKP team worked with Jens Axboe (the Linux block and CFQ maintainer) on this issue with the end result that 2.6.22 reverses the regression and shows a 2 percent improvement.

True Tales of Regressions

Sadly, not all regressions can be prevented, even when identified. A recent example was a change in the Linux TCP/IP stack where the kernel started to send more ACK packets than before. These extra packets cost performance (sending packets takes work and takes away bandwidth from other packets). However, the change was made intentionally to fix a correctness bug where Linux in some cases didn’t send the packets that it was supposed to send, according to the protocol. Of course, everyone prefers a reliable TCP/IP connection over one that is a little bit faster but unreliable.

Another recent change that caused a benchmark regression was a virtual memory tunable change. The VM has a tunable that controls when background disk I/O starts to happen. The tunable is expressed as a percentage of memory that would need to go to disk. This percentage used to be set at 40 percent: if 40 percent of the memory in a system was modified, the kernel would start sending data to the disk. Recently, Linus changed the default value of this percentage to 10 percent.

After the change, some of the benchmarks started to show lower performance. Linus and Andrew Morton investigated the issue when the LKP team brought it up, and concluded that, in this case, the benchmark behavior was artificial and not representative of real-world applications. In addition, the new setting of 10 percent would improve system behavior for real people by having a much smoother I/O pattern, rather than a chunky I/O pattern.

Much like beauty, benchmarks depend on the eye of the beholder.

Comments on "Fighting Performance Regression"

Hi, i think that i saw you visited my blog so i came to “return the favor”.I’m trying to find things to improve my site!I suppose its ok to use some of your ideas!!

Leave a Reply