The Curious Case of the Failing Connections, Part 2

This week we continue on the trek to locate the source of failing MySQL connections and uncover a solution.


As you may recall from last week, we were starting to troubleshoot is strange problem with failing or dropped connections between our Sphinx servers and MySQL slaves. We were in the midst of suspecting the network and pick up the story there.

The Suspects

At this point I should note that we weren’t really running any production services at this new data center. This was all part of the process of getting it ready to do so. It wasn’t that unreasonable to think that something might be misconfigured. And, again, even though it tested well, we were also essentially on a new hardware platform and a new operating system release.

We tried moving servers around on the network: move one MySQL box and one Sphinx box from this set of ports to that set of ports to see if it’s a bad linecard in our switch. No dice. The problem was still there. We measured network performance between the hosts and things seemed okay there too.

It turned out that the TCP send queue on the masters (the sending side) would fill, apparently causing other network operations to stall until it could drain a bit. It’s as if the slaves were asking for so much data that it effectively saturated the master’s network and brought down other network connections with it.

But this made little sense, since the same method worked well in production in our live data center. We used tcpdump to capture network traffic in an attempt to look for clues. The data looked normal, though the throughput seemed very uneven.

Maybe network flow control issues? Nope. Ruled that out too.

The Culprit

Eventually we realized that there were actually two factors involved. The errors were only occurring when the Sphinx masters tried to connect to MySQL while one or more of the Sphinx slaves was cloning the last set of indexes via rsync. Ah ha! Intensive disk activity and network activity together were required to trigger the problem!

After quite a bit more debugging and testing, we noticed counters being increased at the OS level. And one of our systems administrators found that he could trivially reproduce the problem without firing up all the Sphinx apparatus. Just simple over-the-network file copy between two of the hosts was enough see the ping times jump and all the symptoms we were so familiar with.

At this point we spent some time focusing on the disks. They were slower, at 5400 RPM vs. the 7200 RPM in most of our servers. When we setup the new data center, I specifically requested the higher capacity disks because I needed the space more than speed. In fact, most of the time the complete data set fits in RAM anyway. It was a compromise we had to make given the limited disk options available for our blades. Our indexes are generally in the 20-24GB range anyway and the servers have 32GB installed. But maybe it’s possible that the disks were just slow enough to cause problems?

We did some calculations and followed that line of reasoning for a while before checking the memory stats again and realizing that there was really nothing else substantial running on the machines. In fact, there was no good reason reason that the data shouldn’t fit completely in memory and then be flushed to disk as the OS gets around to it.

The Real Culprit

At this point we were really puzzled and began to cast a wider net. We started trolling through the various log files in /var/log hoping to see something out of the ordinary jump out. It was in that process that someone made an alarming discovery: it appeared that the disk controller and the network interface were sharing an interrupt.

Oh, no! That makes complete sense. We were only seeing problems when both the network and disk were heavily used. It was conceivable that if the system had to figure out who triggered an interrupt every time there was a network or disk interrupt, it could get behind the curve under load. And that’s just what seemed to be happening.

But why? What would cause the disk and network to end up on the same IRQ in the first place?

We remotely rebooted one of the Sphinx servers and poked around in the BIOS. Everything looked normal there. But as the kernel booted we found that both devices ended up on a shared interrupt again. Glancing at the boot-time options that Grub was passing to the kernel revealed that we were setting the noapic option.

Well that makes sense. We were telling the kernel not to use any of the “high interrupts” available in modern architectures, instead reverting to the old IRQ options available in the land of ISA cards and 386 processors.

So we rebooted a slave without the noapic option and, sure enough, found that the network and disk ended up on different interrupts. Our network and disk tests then showed far more level throughput.

A few days later, we changed the boot-time options on all the blades in our new data center to avoid the pitfalls of noapic.


I’m not sure why we were ever passing the noapic option to our kernels at boot time. But it clearly had worked fine on our previous hardware platform running OpenSUSE 10.2. A bit of Google searching reveals a few interesting links, however: Why do so many machines need “noapic”? and What do the noapic/nolapic kernel arguments do?. Both lead to interesting discussions that may leave you believing that noapic is just one of those “safe defaults” that usually doesn’t cause problems–until it does.

What I am sure about is how important it is to have good monitoring in place. Without some mildly aggressive monitoring in place it’s very likely that the problem could have gone undetected for quite some time. We generate a lot of log files. Someone (probably me) would have needed to notice the error and notice that it was occurring with some regularity. And if that happened after the new data center was live, debugging could have been far more frustrating.

But most importantly, this whole exercise has reinforced how important it is to have a good team of people who can work together to understand how all the pieces (software/applications, operating system, hardware, and network) work together. Without all the right people involved, a problem like this can go unsolved for weeks or even months.

Comments on "The Curious Case of the Failing Connections, Part 2"


I thought this article set was great. I\’m really curious to know how long the time was from onset to resolution, both in real time spent and in \”billable\” work hours.


To be Honest,…
This was tooo funny :-) I know not for you but for the reader it is :-)
One of these unbelieveables, wtf ones hehe…

know those problems well… had nice readintime hehe

btw: and your network team was right (as always) theyre switches where fine hehe … (uh yes i know its emberassing, at least to yourself, to admitt the problem server side when it defintly 100% looks like networkproblem :-)


I wonder if thats why my laptop seems to \”hang\” on bootup but the hard disk starts flashing immediately I hit the shift key – and repeatably!

I have to keep pressing shift to make ti boot up.


I\’m assuming that both the disk and the network controllers are PCI resources. If so, then the assignment of interrupts to the various PCI resources is a BIOS function. It\’s limited by the capabilities of your specific chipset and what legacy interrupts are being used by other resources.

For example, if you have neither a parallel port nor a floppy disk in your servers and disable those controllers in the BIOS, you should be able to use their legacy interrupts (IRQ7 and IRQ6 I think) for other more general purposes. You may have to make other changes in BIOS setup to make the interrupts available for PCI.

Anyway, if these are multi-core processors, then you ought to be able to run the APIC anyway.


Best article in a while. Both well written and informative!


Thanks for the comments on this story, everyone.

How long did this take? Oh, it collectively took several \”people days\” of time. Hard to say since it was a little on and off and we had different people involved at different times.

Yes, the networks and disk controller were both PCI resources and the BIOS *wanted* to do the Right Thing. Once noapic was removed, that cured it. :-)


Jeremy, I have to say that you have a nice collection of dedicated system administrators. It is funny however \”noapic\” was the culprit and it was a good thing that the systems were not live because trust me it will be much more difficult even almost impossible except you compromised your 5 nines for uptime. Great Article!


Yeah, if we\’d not noticed this until things were running in production, it would have been a nightmare to figure out.


Great post…seems most people do not think about monitoring on the front end, setting up baselines and spending the time (it does take time) to actually understand the individual pieces well enough that they can form a decent educated hypothesis when things are not working right.

How many Sys Admins are reading this thinking my boss will never give me the time to figure it all out, yet they will want an instant solution when the problem happens…

Great post, I love to not just hear what the solution to a problem is/was, but the steps and tools used to arrive at that conclusion. Thanks!


WONDERFUL Post.thanks for share..extra wait .. …


Hi! Do you know if they make any plugins to protect against hackers? I’m kinda paranoid about losing everything I’ve worked hard on. Any tips?


Wow, great blog.Really looking forward to read more. Awesome.


We prefer to honor lots of other world-wide-web websites on the web, even if they aren?t linked to us, by linking to them. Below are some webpages really worth checking out.


We like to honor quite a few other web websites around the internet, even when they aren?t linked to us, by linking to them. Under are some webpages really worth checking out.


Here are some of the web-sites we advocate for our visitors.


MqVlQH When someone writes an article he/she maintains the idea


Enjoyed every bit of your blog.Really thank you! Fantastic.


Thanks for sharing excellent informations. Your website is very cool. I am impressed by the details that you have on this web site. It reveals how nicely you perceive this subject. Bookmarked this website page, will come back for more articles. You, my friend, ROCK! I found simply the information I already searched everywhere and simply could not come across. What a great web-site.


Thank you for any other great post. The place else could anyone get that kind of info in such a perfect means of writing? I’ve a presentation subsequent week, and I’m on the look for such information.|


Here are some links to web sites that we link to for the reason that we assume they’re really worth visiting.


That may be the finish of this report. Right here you will discover some web pages that we consider you?ll value, just click the hyperlinks.


I really appreciate this post. I have been looking everywhere for this! Thank goodness I found it on Bing. You’ve made my day! Thanks again!


Below you?ll uncover the link to some websites that we assume it is best to visit.


Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>