This week we continue on the trek to locate the source of failing MySQL connections and uncover a solution.
As you may recall from last week, we were starting to troubleshoot is strange problem with failing or dropped connections between our Sphinx servers and MySQL slaves. We were in the midst of suspecting the network and pick up the story there.
At this point I should note that we weren’t really running any production services at this new data center. This was all part of the process of getting it ready to do so. It wasn’t that unreasonable to think that something might be misconfigured. And, again, even though it tested well, we were also essentially on a new hardware platform and a new operating system release.
We tried moving servers around on the network: move one MySQL box and one Sphinx box from this set of ports to that set of ports to see if it’s a bad linecard in our switch. No dice. The problem was still there. We measured network performance between the hosts and things seemed okay there too.
It turned out that the TCP send queue on the masters (the sending side) would fill, apparently causing other network operations to stall until it could drain a bit. It’s as if the slaves were asking for so much data that it effectively saturated the master’s network and brought down other network connections with it.
But this made little sense, since the same method worked well in production in our live data center. We used
tcpdump to capture network traffic in an attempt to look for clues. The data looked normal, though the throughput seemed very uneven.
Maybe network flow control issues? Nope. Ruled that out too.
Eventually we realized that there were actually two factors involved. The errors were only occurring when the Sphinx masters tried to connect to MySQL while one or more of the Sphinx slaves was cloning the last set of indexes via rsync. Ah ha! Intensive disk activity and network activity together were required to trigger the problem!
After quite a bit more debugging and testing, we noticed counters being increased at the OS level. And one of our systems administrators found that he could trivially reproduce the problem without firing up all the Sphinx apparatus. Just simple over-the-network file copy between two of the hosts was enough see the ping times jump and all the symptoms we were so familiar with.
At this point we spent some time focusing on the disks. They were slower, at 5400 RPM vs. the 7200 RPM in most of our servers. When we setup the new data center, I specifically requested the higher capacity disks because I needed the space more than speed. In fact, most of the time the complete data set fits in RAM anyway. It was a compromise we had to make given the limited disk options available for our blades. Our indexes are generally in the 20-24GB range anyway and the servers have 32GB installed. But maybe it’s possible that the disks were just slow enough to cause problems?
We did some calculations and followed that line of reasoning for a while before checking the memory stats again and realizing that there was really nothing else substantial running on the machines. In fact, there was no good reason reason that the data shouldn’t fit completely in memory and then be flushed to disk as the OS gets around to it.
The Real Culprit
At this point we were really puzzled and began to cast a wider net. We started trolling through the various log files in
/var/log hoping to see something out of the ordinary jump out. It was in that process that someone made an alarming discovery: it appeared that the disk controller and the network interface were sharing an interrupt.
Oh, no! That makes complete sense. We were only seeing problems when both the network and disk were heavily used. It was conceivable that if the system had to figure out who triggered an interrupt every time there was a network or disk interrupt, it could get behind the curve under load. And that’s just what seemed to be happening.
But why? What would cause the disk and network to end up on the same IRQ in the first place?
We remotely rebooted one of the Sphinx servers and poked around in the BIOS. Everything looked normal there. But as the kernel booted we found that both devices ended up on a shared interrupt again. Glancing at the boot-time options that Grub was passing to the kernel revealed that we were setting the
Well that makes sense. We were telling the kernel not to use any of the “high interrupts” available in modern architectures, instead reverting to the old IRQ options available in the land of ISA cards and 386 processors.
So we rebooted a slave without the noapic option and, sure enough, found that the network and disk ended up on different interrupts. Our network and disk tests then showed far more level throughput.
A few days later, we changed the boot-time options on all the blades in our new data center to avoid the pitfalls of noapic.
I’m not sure why we were ever passing the noapic option to our kernels at boot time. But it clearly had worked fine on our previous hardware platform running OpenSUSE 10.2. A bit of Google searching reveals a few interesting links, however: Why do so many machines need “noapic”? and What do the noapic/nolapic kernel arguments do?. Both lead to interesting discussions that may leave you believing that noapic is just one of those “safe defaults” that usually doesn’t cause problems–until it does.
What I am sure about is how important it is to have good monitoring in place. Without some mildly aggressive monitoring in place it’s very likely that the problem could have gone undetected for quite some time. We generate a lot of log files. Someone (probably me) would have needed to notice the error and notice that it was occurring with some regularity. And if that happened after the new data center was live, debugging could have been far more frustrating.
But most importantly, this whole exercise has reinforced how important it is to have a good team of people who can work together to understand how all the pieces (software/applications, operating system, hardware, and network) work together. Without all the right people involved, a problem like this can go unsolved for weeks or even months.