Troubleshooting is often an adventure that takes you to unexpected places and teaches you a lot along the way.
One of the frustrating things about building and running backend infrastructure services is that things break sometimes — often in unexpected ways. And when they do, you can spend untold hours chasing down various ideas, most of which are dead ends, before eventually solving the problem. The goal in the end, as always, is to have a more stable and resilient service that will serve your users without getting you out of bed in the middle of the night.
What’s often interesting about these problems is not necessarily the solution — it’s what you learn along the way. If you’re not debugging and testing alone, you’ve likely got a small team of folks helping you out. And in that process you’ll discover things you’d never encountered before, including hardware/software interactions that you’d previously never considered.
It can be incredibly educational, if not frustrating and tiresome.
This is the story of one such problem and our attempts to solve it.
If you’ve read much of my previous writing, you may know that one of the services I work on at Craigslist is search. We have a few clusters of Sphinx servers configured in a master/slave configuration. The masters are responsible for incrementally updating search indexes and making them available for the slaves to pull via rsync. They do this by periodically querying a slave of our master classifieds MySQL database to pull lists of all the changes since the last incremental build. The code first fetches a list of all deleted document ids so it can mark them as deleted in the per-city indexes. Then it asks for a list of all other new and changed documents so they can be indexed. The resulting index is then merged into an active index.
My code takes those documents, transforms them into XML for Sphinx, and runs several parallel index builds and index merges to produce the next index snapshot that the slaves are looking for. Once that’s done, the new indexes are advertised to the slaves so they can rsync the changes during their next check for update indexes, run via cron.
That’s all pretty straightforward. We’d been running this for many months for our production search clusters in our main data center without issue. But we recently went through the exercise of bringing up Sphinx clusters in our newest data center. As part of this process, we were making two notable but seemingly innocuous changes. First, we were moving to a new hardware platform that consists primarily of blade servers from IBM. They’d performed well in our evaluation last year and appeared to run our software well.
The second change was moving from our older OpenSUSE 10.2 to OpenSUSE 11.1. Again, we’d done a fair amount of testing in our development environment and found that the change was pretty transparent, aside from needing to put a few more packages on our default installation list.
For a few weeks, things appeared to be running just fine. The blades performed well and we saw no unusual errors. My brief load testing indicated that the 3GHz CPUs could handle many hundreds of queries per second, easily matching our previous infrastructure. But when we enabled alerting on our monitoring system, we realized that something was amiss. We were seeing periodic alerts about “stale” sphinx indexes on both of the newly configured clusters.
Hmm. Upon digging into it a bit, I realized a few things and narrowed the search. These weren’t index build failures. The index building and merge process was working fine. So there wasn’t a problem running binaries I’d build on 10.2 on the 11.1 boxes. And it wasn’t a problem with the slaves copying indexes from the masters. The rsync processes fired as expected and worked just fine. It’s just that the master’s indexes weren’t always as “fresh” as we’d hoped for. That allowed me to narrow the focus of the troubleshooting.
Digging deeper into the logs, found out what was happening. The Perl code responsible for connecting to the slave database and pulling the deleted, changed, and new documents was occasionally failing in an odd way — it was loosing its connection to MySQL and logging an error message that looks like this:
Lost connection to MySQL server at 'reading authorization packet', system error: 0
Well that’s just weird. Why would that happen?
A bit of searching around revealed an entry from the public MySQL bug database: Bug #28359: Intermittent lost connection at ‘reading authorization packet’ errors. Reading through that bug was very enlightening and gave us some new insight into how the MySQL client and server negotiate a new connection and the “handshake” process they go through.
On things that bug report suggested was checking the values of two global system counters in MySQL: Aborted_clients and Aborted_connects.
mysql> show status like 'abort%';
| Variable_name | Value |
| Aborted_clients | 199461 |
| Aborted_connects | 59490 |
2 rows in set (0.05 sec)
Sure enough, our counters were increasing too.
This led us to suspect the network. The networking team always loves it when we do that. But it seemed as though the messages from the MySQL server weren’t getting to the Sphinx master in a timely fashion.
They checked for errors on the switch ports that the Sphinx and MySQL servers were plugged into. Nothing unusual showed up. They contacted our switch vendor for advice. In the meantime, we started looking into more details on the operating system side to see if anything else would suggest a source of the trouble. I tried doing simple ping tests from servers in the Sphinx cluster to servers in the MySQL cluster. And, much to my surprise, we did see a pattern repeat itself. Under normal circumstances, the ping times were in the range of 0.2 to 0.4 milliseconds. In other words, they were completely normal.
But during those intervals when the MySQL connections would fail, ping times would jump into the 40 millisecond range. Seeing times increase by a factor of 100 suggested that we were definitely looking at some kind of networking problem. We pressed our switch vendor for more ideas and assistance.
To Be Continued…
Check back in next week to see what we tried next and how we ultimately diagnosed and solved the problem.