Out of the box, MySQL isn't exactly tuned for resilience on a busy network where things occasionally go haywire.
A few weeks ago we ran into another set of a problems that pointed at suboptimal default settings in MySQL. Frustrated, I realized that I had a good collection of these problems and decided to rant about them on my blog. Having had a bit of time to think about what I said and talk to some other folks about it, I feel like it’s worth expanding what I wrote and sharing it with a wider audience.
Oh, you may be wondering where the third part of my CouchDB series is. A combination of things conspired against me this week, not the least of which seems to be the issue known as COUCHDB-345 in the bug tracking system used by the Apache Software Foundation. I’m still trying to work out the root of the problem(s) I’m seeing. Hopefully I will return to that soon–things were just starting to get interesting.
Before I get into the nitty gritty details, I want to be clear that the changes I’m suggesting here are meant to be used in what I consider typical high-volume environments. That means a number of MySQL servers using master/slave replication, multiple busy web servers talking to them, and both read and write queries happening at the rate of dozens (or more) per second.
This doesn’t apply to applications or sites that have mostly static data or see very low usage most of the time. In fact, some of these could be troublesome in such environments.
The main motivation behind these changes is to make MySQL more tolerant of transient network problems–the kind that can interrupt replication, break connections between clients and servers, cause DNS resolution to become slow or fail entirely, or cause other types of mayhem that impact MySQL or you applications.
In other words, I see this as an exercise in hardening MySQL and helping to ensure that its built in recovery and abuse mitigation mechanisms don’t work against you. Caveat lector.
There is no heartbeat mechanism that helps slaves to know that their connection to a master hasn’t vanished. What can happen is a network connection gets broken between the two in a way that neither detects. This is typically a firewall/router issue or something that neither host initiates or sees, so neither one is able to send (or receive) a TCP packet that would normally begin to shut down the connection. This is especially true of replication topologies that involve crossing significant distances where multiple networks and providers may be involved.
MySQL uses a simple timeout mechanism to detect this hopefully rare occurrence. If the slave I/O thread has not seen anything from the master in slave_net_timeout seconds, it will connect and then attempt to reconnect and continue replicating. That mechanism works very well and allows slaves to deal with the occasional network glitch.
Unfortunately the default value for this variable is 3600. That’s a full hour of time that passes before the slave decides to give up and try starting with a new connection. So not only do you run the risk of a slave being nearly an hour behind on replication, you may find that this is trickier to detect than you might think!
Why? Because it’s common to use the output of
SHOW SLAVE STATUS to check the health of replication. The field Seconds_Behind_Master generally gives you a sense of whether or not the slave is keeping up. But, that’s not entirely correct. What that value actually tells you is how far behind the master the slave THINKS it is. If it hasn’t seen anything new from the master in 15 minutes and the SQL thread has executed all the queries that the I/O thread wrote to the relay log, it will happily report that it’s 0 seconds behind. Ignorance is bliss.
So it’s clear watching Seconds_Behind_Master is not the best way to monitor the replication health of your slaves. But that’s secondary to the real problem here. I recommend setting slave_net_timeout far lower than the default. Personally, I think that anything higher than 20 seconds is asking for trouble.
Of course this all depends on the nature of your application and how sensitive it may be to slightly out of date slaves. But if you’re expecting writes on the master to happen every second or two, there’s no reason for such a high timeout on the slaves. You’re just asking for trouble some day.
Out of the box, MySQL will perform a reverse DNS lookup on the IP address of any new incoming connection to get the host name from which the connection originated. It then uses that information internally when consulting the various privilege tables to make sure the user connecting is allowed to connect and perform whatever actions he or she may try to perform.
But, believe it or not, DNS is flaky at times. It’s nothing really inherent to DNS itself, but this is an external dependency that doesn’t need to be there in the first place. A simple dropped packed or mis-configured DNS server can mean MySQL connection being denied or delayed for reasons that may not be immediately obvious. A denial of service (DoS) attack against your DNS server could be sufficient to bring new database connections to a standstill.
I recommend adding
skip-name-resolve to your /etc/my.cnf file and updating all your
GRANT statements to use IP addresses instead of DNS names. This simple change can pay dividends for years to come.
connect_timeout and max_connect_errors
MySQL has a limit to how many clients can be connected at one time: max_connections. So one way to DoS a MySQL server is open up many connections but communicate with the server very slowly–especially during the initial authorization handshake. In doing so you use up all the available connections and lock out anyone else from getting in.
To mitigate this, MySQL has a timeout on establishing new connections: connect_timeout. By default this is set to 5 seconds. And normally that’s a perfect fine value. However, if you have a situation where connections simply take longer–say packet loss on your network or DNS servers timing out on that reverse lookup, you can end up in a situation where MySQL is closing connections on perfectly good clients. Those clients are likely to try to reconnect in short order and that almost makes the problem worse.
When this happens, MySQL internally increments a per-host counter it uses to track how many “bad connections” it has seen. When this counter exceeds the value of max_connect_errors MySQL will block the host from connecting again until you issue a
FLUSH HOSTS command. By increasing the connect_timeout a bit you can avert this type of situation.
It’s possible to set max_connect_errors to a sufficiently large value that you may never encounter the situation described above. A number as large as 1,844,674,407,370,954,751 is a pretty high bar, yet is perfect valid if you’re using a 64-bit version of MySQL (and you really should be!).
Sadly, setting the variable to 0 does not disable the check or the lockout behavior described above, so you have to resort to games like this. The good news is that you can make the change without restarting your servers.
Doing the Deed
Before you start making changes, it’s probably worth looking over the list of Server System Variables in MySQL. Make sure to check the documentation that matches the version of MySQL you’re running. Things changed between 4.1, 5.0, 5.1, and up to 5.4. There may be other problematic settings you want to correct at the same time.
There’s a table on that page that tells you if each variable is “dynamic” and if it can be set in the an option file (meaning /etc/my.cnf for our purposes). The dynamic values are easy because you can change them on the fly using a
SET GLOBAL variable = value command. However, you have to make sure to also make the corresponding change in the configuration file so that the server doesn’t revert back to its old behavior after a reboot.
Any non-dynamic values can only be changed in the configuration file and will require a restart of MySQL to take effect.
I hope this has given you some incentive to look at your configuration settings for MySQL and make sure you understand more of the defaults and how they could affect your systems. I find that it’s a good idea to scan through a listing of the variables now and then just to make sure the settings are still sensible.
Have you found yourself needing to make surprising adjustments to some of MySQL’s settings?