MySQL Clustering

This month and next, we'll look at the most significant addition to MySQL 4.1: native clustering. This month, let's start with an overview of the new clustering technology, see how it's been integrated into MySQL, and understand the benefits it provides. Next month, we'll cover the steps necessary to get a cluster up and running.

This month and next, we’ll look at the most significant addition to MySQL 4.1: native clustering. This month, let’s start with an overview of the new clustering technology, see how it’s been integrated into MySQL, and understand the benefits it provides. Next month, we’ll cover the steps necessary to get a cluster up and running.

Introducing NDB

Last year, MySQL AB (the company the builds and supports MySQL) acquired a small company named Alzato, which was part of Ericsson. Alzato’s main product was a highly-available database platform known as NDB. Ericsson, being in the telecommunications industry, had a keen interest in being able to produce a very fast, low-latency, and highly-available database platform. MySQL AB and the folks from Alzato then began work on integrating NDB into MySQL.

NDB is efficient by design: It’s a transactional database engine with row-level locking that isn’t burdened by a mile-long list of features. It does a few things and does them very, very quickly.

Unlike most of MySQL’s storage engines, NDB is a main memory database engine, which means that it stores all data in RAM and uses disks only for logging. By keeping disks mostly out of the loop, NDB is able to locate and retrieve data with very low latencies. NDB achieves high availability (and a good degree of scalability) through clustering.

What is Clustering?

Simply put, clustering is the act of logically combining multiple servers to achieve increased performance, redundancy, and scale. If one of the nodes in the cluster fails, one or more of the remaining nodes assume the responsibilities of the failed node until it comes back on-line. This is completely transparent to the client application. From the client application’s point of view, a database cluster is a single entity.

While there a several popular techniques for doing this (especially in the database world), NDB uses a shared nothing clustering implementation. That means each node in the cluster is completely independent. There are no physically shared components, such as a NAS device. Again, a failure of one node doesn’t affect other nodes, aside from making more work for them.

Furthermore, NDB implements a hash-based partitioning system that aims to distribute data evenly among all the node groups in the cluster. The result is that each node receives and is responsible for only a subset of the entire data set. However, each piece of data (each row) exists on at least two nodes.

NDB with MySQL

Given all the new features that NDB brings to the table, you may wonder how it’s being integrated with MySQL. The answer is simple: just like every other database engine in MySQL, NDB takes the form of a new storage engine. However, unlike BDB or InnoDB, the NDBCluster storage engine doesn’t actually store data on the MySQL server. Instead it acts as a conduit that connects a MySQL server to an NDB cluster. Figure One illustrates this.

Figure One: The NDBCluster engine connects the MySQL server to an NDB cluster

Put another way, if you create a table of type NDBCluster, it doesn’t actually exist on the MySQL server. The .frm file is there, of course, but the table is actually created on the nodes in the cluster. The NDBCluster storage engine handles all the communication between MySQL and the NDB cluster. The cluster nodes communicate with each other.

When it comes time to get data out of MySQL (and out of the cluster), the division of labor isn’t any different than with a more traditional storage engine, such as MyISAM or InnoDB. In the case of a simple, one-table SELECT query, MySQL parses the query and then figures out which index it can use to most efficiently retrieve the matching record(s).

MySQL then makes requests to the appropriate storage engine (NDBCluster, in this case), asking for all of the records that match the selection criteria on the indexed column(s). Once it’s retrieved the records, it performs any additional filtering or transformations before returning the rows to the client.

To put this in more concrete terms, say you have a phone_book NDB table where field last_name is indexed. Consider how MySQL runs the following query:

SELECT phone_number FROM phone_book WHERE
last_name = ‘smith’ AND first_name = ‘henry’

Here, MySQL asks the NDB cluster for all rows where the last_name is smith because that’s the indexed column. The cluster nodes return all those rows back to the MySQL server, which then examines the first_name column to discard any that don’t match henry.

If there are many smith records in the table, it’s pretty clear that this isn’t the optimal arrangement. While there may be many cluster nodes with ample CPU power to perform the “henry” comparison on their own, they never bother. Instead they send all the candidate records to a single MySQL server, which must then do that work.

NDB has facilities for the caller (MySQL in this case) to provide some filtering information to the cluster nodes. This gives them the ability to do a bit more work and weed out records that were never wanted in the first place. Currently, MySQL doesn’t make use of this feature in NDB, but that’s expected to change.

Redundancy and Replication

Looking at Figure One, you may have noticed that there’s still a single point of failure: the MySQL server itself! Without correcting that glaring omission, you’re still at the mercy of a single server. Luckily, this problem is easy to solve. Simply add a second MySQL server and configure it to communicate with the cluster, too.

Okay, but what about replication? It turns out that MySQL’s existing, log-based, asynchronous replication system works with the NDB storage engine. So if you have a MySQL server and NDB cluster running in one data center, it’s possible to use the binary logs from that MySQL server to replicate all the data into a MySQL server in another data center. That second MySQL server may or may not have another full-blown NDB cluster behind it.

However, if you’re aiming for high availability and have added a second MySQL server in front of the cluster, you’ll need to ensure that all writes happen on only one of the two MySQL servers. The “write” server’s binary log will be used by the slave in the second data center. And since a slave may have only one master, it cannot pull queries from both MySQL servers.

NDB also has its own synchronous replication technology that can be used to keep a backup cluster in sync with the primary cluster. However, this technology has not yet been exposed as part of the MySQL integration.

The addition of yet another storage engine (“YASE”?) means more things to consider when building a new MySQL-based application. Though NDB is not new technology, the integration with MySQL is new. As a result, some features that you might wish for in NDB tables don’t exist yet. Full-text search, for example, is still MyISAM-specific.

With some basic theory out of the way, we’ll get down to business next month by configuring a cluster and running a simple benchmark.

Contact Jeremy Zawodny at Jeremy@Zawodny.com.

Comments are closed.