dcsimg

The Curious Case of the Failing Connections

Troubleshooting is often an adventure that takes you to unexpected places and teaches you a lot along the way.

One of the frustrating things about building and running backend infrastructure services is that things break sometimes — often in unexpected ways. And when they do, you can spend untold hours chasing down various ideas, most of which are dead ends, before eventually solving the problem. The goal in the end, as always, is to have a more stable and resilient service that will serve your users without getting you out of bed in the middle of the night.

What’s often interesting about these problems is not necessarily the solution — it’s what you learn along the way. If you’re not debugging and testing alone, you’ve likely got a small team of folks helping you out. And in that process you’ll discover things you’d never encountered before, including hardware/software interactions that you’d previously never considered.

It can be incredibly educational, if not frustrating and tiresome.

This is the story of one such problem and our attempts to solve it.

Background

If you’ve read much of my previous writing, you may know that one of the services I work on at Craigslist is search. We have a few clusters of Sphinx servers configured in a master/slave configuration. The masters are responsible for incrementally updating search indexes and making them available for the slaves to pull via rsync. They do this by periodically querying a slave of our master classifieds MySQL database to pull lists of all the changes since the last incremental build. The code first fetches a list of all deleted document ids so it can mark them as deleted in the per-city indexes. Then it asks for a list of all other new and changed documents so they can be indexed. The resulting index is then merged into an active index.

My code takes those documents, transforms them into XML for Sphinx, and runs several parallel index builds and index merges to produce the next index snapshot that the slaves are looking for. Once that’s done, the new indexes are advertised to the slaves so they can rsync the changes during their next check for update indexes, run via cron.

Legacy

That’s all pretty straightforward. We’d been running this for many months for our production search clusters in our main data center without issue. But we recently went through the exercise of bringing up Sphinx clusters in our newest data center. As part of this process, we were making two notable but seemingly innocuous changes. First, we were moving to a new hardware platform that consists primarily of blade servers from IBM. They’d performed well in our evaluation last year and appeared to run our software well.

The second change was moving from our older OpenSUSE 10.2 to OpenSUSE 11.1. Again, we’d done a fair amount of testing in our development environment and found that the change was pretty transparent, aside from needing to put a few more packages on our default installation list.

The Problem

For a few weeks, things appeared to be running just fine. The blades performed well and we saw no unusual errors. My brief load testing indicated that the 3GHz CPUs could handle many hundreds of queries per second, easily matching our previous infrastructure. But when we enabled alerting on our monitoring system, we realized that something was amiss. We were seeing periodic alerts about “stale” sphinx indexes on both of the newly configured clusters.

Hmm. Upon digging into it a bit, I realized a few things and narrowed the search. These weren’t index build failures. The index building and merge process was working fine. So there wasn’t a problem running binaries I’d build on 10.2 on the 11.1 boxes. And it wasn’t a problem with the slaves copying indexes from the masters. The rsync processes fired as expected and worked just fine. It’s just that the master’s indexes weren’t always as “fresh” as we’d hoped for. That allowed me to narrow the focus of the troubleshooting.

The Suspects

Digging deeper into the logs, found out what was happening. The Perl code responsible for connecting to the slave database and pulling the deleted, changed, and new documents was occasionally failing in an odd way — it was loosing its connection to MySQL and logging an error message that looks like this:

Lost connection to MySQL server at 'reading authorization packet', system error: 0

Well that’s just weird. Why would that happen?

A bit of searching around revealed an entry from the public MySQL bug database: Bug #28359: Intermittent lost connection at ‘reading authorization packet’ errors. Reading through that bug was very enlightening and gave us some new insight into how the MySQL client and server negotiate a new connection and the “handshake” process they go through.

On things that bug report suggested was checking the values of two global system counters in MySQL: Aborted_clients and Aborted_connects.

mysql> show status like 'abort%';
+------------------+--------+
| Variable_name    | Value |
+------------------+--------+
| Aborted_clients  | 199461 |
| Aborted_connects | 59490  |
+------------------+--------+
2 rows in set (0.05 sec)

Sure enough, our counters were increasing too.

This led us to suspect the network. The networking team always loves it when we do that. But it seemed as though the messages from the MySQL server weren’t getting to the Sphinx master in a timely fashion.

They checked for errors on the switch ports that the Sphinx and MySQL servers were plugged into. Nothing unusual showed up. They contacted our switch vendor for advice. In the meantime, we started looking into more details on the operating system side to see if anything else would suggest a source of the trouble. I tried doing simple ping tests from servers in the Sphinx cluster to servers in the MySQL cluster. And, much to my surprise, we did see a pattern repeat itself. Under normal circumstances, the ping times were in the range of 0.2 to 0.4 milliseconds. In other words, they were completely normal.

But during those intervals when the MySQL connections would fail, ping times would jump into the 40 millisecond range. Seeing times increase by a factor of 100 suggested that we were definitely looking at some kind of networking problem. We pressed our switch vendor for more ideas and assistance.

To Be Continued…

Check back in next week to see what we tried next and how we ultimately diagnosed and solved the problem.

Comments on "The Curious Case of the Failing Connections"

joseamable

I would like to suggest in general for all articles that the \”to be continued…\” is not something very appropriate for articles. Although it allows the splitting of the article into consumable portions, by the time the next article is available, I will have forgotten all that I read in this particular portion, forcing me to re-read again.

Taking time out from our professional life to read an article and find it incomplete leaves a very bitter taste.

Reply
am4computers

I totally agree with joseamable.

Reply
spremolla

This is like a soap opera, \”to be continued\” in the most thrilling part , jejeje

Reply
uris

I totall agree. It sounds like a Hannah Montana article. This article is obviously targeted to sys admins, so it should be treated as a sys admin article.

Reply
netmatrix

Will Batman and Robin escape the diabolical clutches of the Joker? Tune it next week, same bat time, same bat channel.

Reply
duncan

I totally agree with joseamable.

Reply
darrencook

This was an enjoyable, interesting read; I think that is why having it cut off in the middle has irritated so many of us.
(Two part articles that were planned as such, and where each part has an introduction, a conclusion, and its own lessons, are fine; but this was obviously intended as a one-part article and has simply been chopped in half.)

Reply
wmmlj

ditto all above – there\’s only 1 page, clearly no reason to chop the article.
please don\’t do this.

Reply
haroldmodesto

Ditto, ditto, ditto.

joseamable said:

\”Although it allows the splitting of the article into consumable portions, by the time the next article is available, I will have forgotten all that I read in this particular portion, forcing me to re-read again.\”

And in my case, I will have altogether forgotten about the article. There are days when I follow the links that interest me on the newsletter, and there are days when articles that should have interested me are ignored (due to various reasons).

Reply
argon99

It\’s a lack of respect of the readers by Linux magazine that they chopped this article in half. We are being treated like sheep. They see us as mindless drones that will just obey what they command us without complaint. They think they can string us along so that we read the next issue if they take an important article and cutout the part that we are all interested in.

Maybe they should have secret messages in the magazine. Then we should send in two dollars to buy the decoder ring. We can then find out that the message says \”READ LINUX MAGAZINE.\”

For myself I am in a boolean state. Should I read Linux magazine or should I not read Linux magazine. Let me think (they treat me like a mindless sheep, what should I do?). Well I know I make mistakes so I will let them have another chance. But if I am treated like a mindless sheep again I will have to vote with my eyes and never read Linux magazine again.

Reply
dushyanthh

Totaly agree – very disappointing not to see the end

Reply

Very neat blog article.Thanks Again. Really Cool.

Reply

I loved as much as you’ll receive carried out right here. The sketch is attractive, your authored subject matter stylish. nonetheless, you command get bought an edginess over that you wish be delivering the following. unwell unquestionably come more formerly again as exactly the same nearly a lot often inside case you shield this hike.

Reply

Very well written story. It will be beneficial to anybody who usess it, including yours truly :). Keep up the good work – looking forward to more posts.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>