The Curious Case of the Failing Connections, Part 2

This week we continue on the trek to locate the source of failing MySQL connections and uncover a solution.


As you may recall from last week, we were starting to troubleshoot is strange problem with failing or dropped connections between our Sphinx servers and MySQL slaves. We were in the midst of suspecting the network and pick up the story there.

The Suspects

At this point I should note that we weren’t really running any production services at this new data center. This was all part of the process of getting it ready to do so. It wasn’t that unreasonable to think that something might be misconfigured. And, again, even though it tested well, we were also essentially on a new hardware platform and a new operating system release.

We tried moving servers around on the network: move one MySQL box and one Sphinx box from this set of ports to that set of ports to see if it’s a bad linecard in our switch. No dice. The problem was still there. We measured network performance between the hosts and things seemed okay there too.

It turned out that the TCP send queue on the masters (the sending side) would fill, apparently causing other network operations to stall until it could drain a bit. It’s as if the slaves were asking for so much data that it effectively saturated the master’s network and brought down other network connections with it.

But this made little sense, since the same method worked well in production in our live data center. We used tcpdump to capture network traffic in an attempt to look for clues. The data looked normal, though the throughput seemed very uneven.

Maybe network flow control issues? Nope. Ruled that out too.

The Culprit

Eventually we realized that there were actually two factors involved. The errors were only occurring when the Sphinx masters tried to connect to MySQL while one or more of the Sphinx slaves was cloning the last set of indexes via rsync. Ah ha! Intensive disk activity and network activity together were required to trigger the problem!

After quite a bit more debugging and testing, we noticed counters being increased at the OS level. And one of our systems administrators found that he could trivially reproduce the problem without firing up all the Sphinx apparatus. Just simple over-the-network file copy between two of the hosts was enough see the ping times jump and all the symptoms we were so familiar with.

At this point we spent some time focusing on the disks. They were slower, at 5400 RPM vs. the 7200 RPM in most of our servers. When we setup the new data center, I specifically requested the higher capacity disks because I needed the space more than speed. In fact, most of the time the complete data set fits in RAM anyway. It was a compromise we had to make given the limited disk options available for our blades. Our indexes are generally in the 20-24GB range anyway and the servers have 32GB installed. But maybe it’s possible that the disks were just slow enough to cause problems?

We did some calculations and followed that line of reasoning for a while before checking the memory stats again and realizing that there was really nothing else substantial running on the machines. In fact, there was no good reason reason that the data shouldn’t fit completely in memory and then be flushed to disk as the OS gets around to it.

The Real Culprit

At this point we were really puzzled and began to cast a wider net. We started trolling through the various log files in /var/log hoping to see something out of the ordinary jump out. It was in that process that someone made an alarming discovery: it appeared that the disk controller and the network interface were sharing an interrupt.

Oh, no! That makes complete sense. We were only seeing problems when both the network and disk were heavily used. It was conceivable that if the system had to figure out who triggered an interrupt every time there was a network or disk interrupt, it could get behind the curve under load. And that’s just what seemed to be happening.

But why? What would cause the disk and network to end up on the same IRQ in the first place?

We remotely rebooted one of the Sphinx servers and poked around in the BIOS. Everything looked normal there. But as the kernel booted we found that both devices ended up on a shared interrupt again. Glancing at the boot-time options that Grub was passing to the kernel revealed that we were setting the noapic option.

Well that makes sense. We were telling the kernel not to use any of the “high interrupts” available in modern architectures, instead reverting to the old IRQ options available in the land of ISA cards and 386 processors.

So we rebooted a slave without the noapic option and, sure enough, found that the network and disk ended up on different interrupts. Our network and disk tests then showed far more level throughput.

A few days later, we changed the boot-time options on all the blades in our new data center to avoid the pitfalls of noapic.


I’m not sure why we were ever passing the noapic option to our kernels at boot time. But it clearly had worked fine on our previous hardware platform running OpenSUSE 10.2. A bit of Google searching reveals a few interesting links, however: Why do so many machines need “noapic”? and What do the noapic/nolapic kernel arguments do?. Both lead to interesting discussions that may leave you believing that noapic is just one of those “safe defaults” that usually doesn’t cause problems–until it does.

What I am sure about is how important it is to have good monitoring in place. Without some mildly aggressive monitoring in place it’s very likely that the problem could have gone undetected for quite some time. We generate a lot of log files. Someone (probably me) would have needed to notice the error and notice that it was occurring with some regularity. And if that happened after the new data center was live, debugging could have been far more frustrating.

But most importantly, this whole exercise has reinforced how important it is to have a good team of people who can work together to understand how all the pieces (software/applications, operating system, hardware, and network) work together. Without all the right people involved, a problem like this can go unsolved for weeks or even months.

Comments on "The Curious Case of the Failing Connections, Part 2"

Check beneath, are some totally unrelated sites to ours, on the other hand, they may be most trustworthy sources that we use.

Here is a great Weblog You may Come across Intriguing that we encourage you to visit.

Very couple of sites that transpire to be comprehensive below, from our point of view are undoubtedly well worth checking out.

Here are a few of the internet sites we advocate for our visitors.

Sites of interest we have a link to.

Here is a good Weblog You may Come across Intriguing that we encourage you to visit.

One of our visitors just lately encouraged the following website.

That would be the finish of this article. Right here you?ll uncover some web sites that we consider you?ll value, just click the links.

Every weekend i used to visit this website, as i would like enjoyment,
since this this site conations really good funny information too.

Here is my page AutumnKBueti

The data talked about in the write-up are several of the ideal obtainable.

Wonderful story, reckoned we could combine several unrelated information, nonetheless truly really worth taking a look, whoa did one discover about Mid East has got a lot more problerms also.

Always a major fan of linking to bloggers that I appreciate but really don’t get a lot of link enjoy from.

That would be the finish of this article. Right here you?ll come across some web-sites that we think you?ll appreciate, just click the links.

The facts mentioned inside the article are some of the top available.

Here are some links to internet sites that we link to since we assume they are worth visiting.

Here are several of the web sites we advocate for our visitors.

Although web-sites we backlink to beneath are considerably not connected to ours, we feel they are basically worth a go by, so have a look.

The information mentioned within the report are a few of the most effective available.

Here are some hyperlinks to internet sites that we link to due to the fact we think they are really worth visiting.

One of our guests not too long ago advised the following website.

Here is a good Blog You may Find Intriguing that we encourage you to visit.

Heya i’m the first time here. I came across this board and i
also find It truly useful & it helped me to out a whole lot.

I am hoping to provide something back and aid
others like you aided me.

My homepage – TienUPaynter

Wonderful story, reckoned we could combine several unrelated data, nevertheless seriously really worth taking a appear, whoa did a single find out about Mid East has got a lot more problerms as well.

Here are several of the web-sites we advocate for our visitors.

One of our guests recently advised the following website.

Although internet sites we backlink to beneath are considerably not associated to ours, we really feel they are really worth a go via, so have a look.

We like to honor many other internet websites on the web, even when they aren?t linked to us, by linking to them. Under are some webpages worth checking out.

Wonderful story, reckoned we could combine a few unrelated information, nonetheless definitely really worth taking a appear, whoa did one learn about Mid East has got extra problerms also.

Here are some links to websites that we link to mainly because we assume they are worth visiting.

Usually posts some very fascinating stuff like this. If you?re new to this site.

We came across a cool web page that you just may well love. Take a appear when you want.

Leave a Reply