When dealing with large distributed systems, knowing some basic performance and failure numbers helps you understand what you can reasonably expect both in terms of performance and reliability.
Recently, Google Fellow Jeff Dean gave a presentation at the LADIS 2009 (Large Scale Distributed Systems and Middleware) Conference. As with some of his previous talks, this one did an excellent job of revealing a bit more about how Google builds systems and what they’ve learned in doing so.
Working at the scale they do (thousands and thousands of servers) is especially interesting because you have to really step back and change your thinking about the available building blocks and how they can best fit together to get the job done. And to do so it helps to have a good high level mental model of the pieces and their performance. In Google’s case, this means having 40-80 servers (4-8 cores, 16GB RAM, 2TB disk) in a rack all connected via a gigabit ethernet switch. Multiple (30+) racks then uplink to a another switch to form a cluster. And services are designed to run across multiple clusters in different data centers around the world.
Design for Failure
Jeff discussed their experience with a typical cluster in its first year and presented some sobering downtime and failure statistics. Here’s some of what he presented (all numbers approximate):
- 1 PDU failure causing 500-1,000 machines to suddenly vanish for 6 hours
- 1 network re-wiring project which entails a rolling 5% of machines going offline in a 2 day period
- 20 rack failures, 40-80 machines gone for 1-6 hours
- 5 racks “go wonky” meaning the start seeing up to 50% packet loss
- 8 network maintenance windows, half of which results in 30 minutes of random connectivity loss
- 12 router reloads which kill DNS and forward-facing VIPs for a few minutes
- 3 router failures that take down some traffic for an hour or so
- dozens of 30 second “blips” for DNS
- 1,000 individual machine failures
- thousands of hard drive failures
- slow disks, bad memory, misconfigured or flaky machines, etc.
Now I don’t work at anything near that scale, but I can tell you that I think those numbers scale down pretty well. If you only have 1/4 of a “google cluster” in your infrastructure, you probably will see about 25% of those ballpark numbers in your own experience. Once you have more than a few dozen systems involved, things just seem to break all the time!
Based on my experience, working in an environment like that can be downright maddening–especially when you network is acting up. You quickly learn that the keys to your sanity are redundancy and removing as many Single Points of Failure (SPoF) as possible. See The Curious Case of the Failing Connections and The Curious Case of the Failing Connections, Part 2 for an example of unexpected performance problems.
Relaly, all of this argues for building distributed systems that are as independent as possible and can tolerate failures, even if means giving users partial functionality (because it’s better than no functionality). But figuring out how t design and build scalable distributed systems isn’t easy either. Your first attempt or two may not be quite right. There’s a good chance you’ll underestimate the cost of some operation or another.
Numbers Everyone Should Know
With that in mind, Jeff presented a chart of “Numbers Everyone Should Know” that helps put various operations into perspective.
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
The idea is to facilitate back of the envelope estimates of system performance. He argues that a critical skill is the ability to estimate the performance of a system without having to actually build it. And that’s where those numbers come into play.
To make talking about these easier, let’s pick a few familiar points of reference. If you spend any time playing with
ping or looking at specs for hard disks, you’re probably used to thinking in milliseconds rather than nanoseconds. A millisecond is 1,000,000 nanoseconds. So the “round trip within same datacenter” timing above is 0.5ms, which sounds about right to me.
The exact numbers here aren’t as important as the differences in magnitude as you move up and down the list.
Let’s walk through an example: deciding between a distributed in-memory key/value store (like Redis) or a disk-based system like Berkeley DB is something you can quantify.
If we assume that the data set is sufficiently larger than the available RAM on your server, then most Berkeley DB accesses will have to hit disk and probably result in a few seeks (we’ll ignore the details of hash table vs. B-Tree for now). Let’s call that 3 seeks per request, which is 10,000,000ns (or 10ms) each for a total of 30ms.
However, if you’re using a distributed in-memory key/value store, you need to fetch data from another machine in your cluster. Let’s call that a few 500,000ns (or 0.5ms) round-trips in the same datacenter (assuming persistent network connections and a simple request/response protocol) and the same 3 “seeks” as before, but this time they’re 100ns main memory references on the remote machine. Now we’re looking at a total more like 2-3ms.
Using this estimation technique, you can see that the distributed in-memory key/value store is easily 10 times faster than using a disk-based solution.
Of course, there’s a critical piece of information missing from these numbers: cost. And that cost comes in two flavors. The first is the cost associated with buying enough machines to keep all the data in memory and the network gear that allows them all to talk to each other. The second “cost” is the complexity associated with having 10 servers instead of 1 server. As we saw earlier, when the number of servers grows, the odds of failures rise as well.
So you need to decide how you needs map onto the spectrum of performance and cost/complexity options. If it’s a requirement to serve N requests per second, then you can scope out what that will take given various amounts of data.
It’s fun to design big complicated systems, but it’s no fun at all to support them when they’re not performing or are too fragile because they break all the time. By knowing some basic rules of thumb in advance, you can design with the right expectations in mind and spend a lot less time scratching your head (or pulling your hair out).
Are there are performance or failure estimation tips you regularly use? What are they?