With tools like Apache, Perl, PHP, and Python, building great MySQL applications is easy. Making sure that they are fast, however, requires quite a bit more insight. Here’s what you need to know.
MySQL has a well-deserved reputation for being a very fast database server that’s also quite easy to set up and use. (We looked at the installation process in the March 2001 issue, available on the Web at http://www.linux-mag.com/2001-03/mysql_01.html.) With its growing popularity as a back-end database for Web sites, its visibility has increased dramatically in the last year or so. But few MySQL users know more than how to create a database and write queries against it. Just as thousands of folks are learning about Unix by experimenting with Linux in their spare time, many are learning about relational database systems by playing with MySQL. Most of these newcomers to MySQL have neither a background in relational database theory nor the time to read the entire MySQL manual.
For this reason, we’ve decided to take a look at some of the methods you can use to tune MySQL for optimal performance. After reading this article, you’ll understand some of the techniques that will help you design your MySQL databases and queries so that your applications are efficient. We will assume you are familiar with MySQL and SQL basics, but will not assume you have extensive knowledge of either.
Store Only the Information You Need
It sounds like common sense, but people often take the “kitchen sink” approach to database design. They think of everything they might possibly want to store and design the database to hold all that data. You need to be realistic about your needs and decide what information is really necessary. Often you can generate some data on the fly without having to store it in a database table. In such cases, it also makes sense to do this from an application development point of view as well.
For example, a product table for an online catalog might contain the names, descriptions, sizes, weights, and prices of various products. In addition to the price, you might want to store the tax and shipping cost associated with each item. But there really is no need to do so. First of all, both the tax and shipping cost can easily be computed on the fly (either by your application or by MySQL). Secondly, if the tax or shipping rates were to change, you would have to write the necessary queries to update the tax and shipping rates in every product record.
Sometimes people think that it is too difficult to add fields to their database tables at a later date, so they feel compelled to define as many columns as possible. Well, that’s simply a misconception. In MySQL you can use the ALTER TABLE command to modify the table definition on the fly to suit your changing needs.
For example, if you suddenly realize that you need to add a rank column to your products table (maybe you want to allow users to rate the products in your catalog), you could do the following:
ALTER TABLE products ADD rank INTEGER
This will add a rank column of type integer to your products table. See the MySQL manual for the full description of what you can do with ALTER TABLE .
Only Ask for What You Need — And Be Explicit
Just like saying “only store what you need,” this may seem like more common sense. But this point is overlooked far more often. Why? Because when an application is in development, the requirements often change, so many of the queries end up looking like:
SELECT * FROM sometable
Asking for all the columns is simply the most convenient thing to do when you are not sure which fields you’ll need. However, as your tables grow and change, it can become a performance problem. It is far better in the end to take the extra time after your initial development is done and decide exactly what you need from your queries. Specify the columns explicitly:
SELECT name, rank, description FROM products
This brings up a related point that has more to do with code maintainability than performance. Most programming languages (Perl, Python, PHP, Java, etc.) allow you to access the results of a query by field name and by numeric position. This means you can access the name field or you can access field 0 and get the same data.
In the long run it is far better to use column names than their numeric positions. Why? Because the relative positions of columns in a table or a query result can change. They may change in a table as the result of repeated use of ALTER TABLE . They will change in a query as the result of someone rewriting the query and forgetting to update the application logic to match.
Of course, you still need to be careful about changing your column names! But if you use column names instead of numeric positions, you can grep the source code or use your editor’s search capability to find the code that you need to change if a column name changes.
Normalize Your Table Structures
If you’ve never heard of “data normalization” before, fear not. While normalization can be a complex topic, you can actually benefit a lot from understanding only the most basic normalization concepts.
One of the easiest ways to understand this is to think of your tables as spreadsheets. If you wanted to keep track of your CD collection in a spreadsheet, you might design something like what you see in Figure One.
album track1 track2 track10
—– —— —— ——-
Billboard Top Hits – 1984 Loverboy Shout St. Elmo’s Fire
(Billy Ocean) (Tears for Fears) (John Parr)
This seems reasonable. Most CDs only have 10 tracks on them, right? Not exactly. What if you own a CD with just over 100 tracks on it and several with more than 20 tracks. That means with this method you’d need to have a really wide spreadsheet (or a table with over 100 fields) to hold all the data in the most extreme cases. That’s not good.
One of the goals of a normalized table structure is to minimize the number of “empty cells.” In the case of the above CD table, you’d have a lot of those if you allowed for CDs that might contain 100 tracks. Whenever you are working with field lists that might expand “off to the right” like this CD list does, it’s a sign that you need to split your data up into two or more tables that you can then access together to get the data you need.
Many people who are new to relational database systems don’t actually know what the “relational” in Relational Database Management System (RDBMS) really means. In simple terms, like groups of information are stored in different tables that can then be “joined” together (related) based on data they have in common. Unfortunately, that sounds rather academic and vague. But the CD database presents a concrete situation in which we can look at how to normalize data.
The realization that every list of CDs has a fixed set of attributes (title, artist, year, genre) and a variable set of attributes (the track list) gives us some idea as to how to break things up into multiple tables that we can then relate to each other.
You can create one table that contains a list of all the albums and their fixed attributes and another that contains a list of all the tracks on those albums. So instead of thinking horizontally (as with the spreadsheet), you think vertically — as if we’re creating lists instead of rows — and set up a table structure like the one in Figure Two.
The album’s id (which MySQL will generate for you automatically because we used the AUTO_INCREMENT attribute on the column) is what relates the various tracks to a given album. The album_id field in the tracks table matches an id in the albums table. So to get a list of all the tracks on a given album, you could use a query like this:
SELECT tracks.num, tracks.name
FROM albums, tracks
WHERE albums.title = ‘Billboard Top Hits – 1984′
AND albums.id = tracks.album_id
This structure is both flexible and efficient. The flexibility comes from the fact that you can add data into the system later without having to rework what you already have. For example, if you wanted to add artist information about each album, you could create an artists table that is related back to the albums table just as the tracks tables already is. You would not need to modify your existing structure at all — only add to it.
The efficiency comes from the fact that there is no significant data duplication and there are not a lot of holes (empty cells) in your data. So MySQL neither has to store more data than necessary nor does it have to spend extra effort looking past lots of empty areas in your table.
The bottom line is that normalizing your data may seem a bit strange if you’re new to relational databases, but it allows MySQL to be very efficient when it comes to storing and retrieving your data and gives you the flexibility to grow and scale your applications without having to restructure your database several times along the way. Take the time to think through database designs as early as possible, and consider how your needs may grow over time. The extra time spent up front is always well spent.
Composite Indexes (sometimes called Compound Indexes) are single indexes that are based on multiple columns. MySQL will only use one index per table when processing a query. This means that if you have multiple columns that often appear together in your WHERE clauses, you might be able to speed up those queries by creating a composite index.
Consider the following table structure fragment:
CREATE TABLE people (
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
favorite_color VARCHAR(10) NOT NULL,
If you often query the people table based on both last_ name and first_name, you might benefit from a composite index on last_name and first_name:
INDEX last_first (last_name, first_name)
Because of the way MySQL builds composite indexes, it can use the last_first index to answer queries based on last_name alone or last_name and first_name. It will not, however, be used in a query on just first_name. This is because MySQL will only use a composite index if the columns involved form a “leftmost prefix” of the composite index.
So if you have a composite index composed of many columns:
INDEX big_index (a, b, c, d, e, f, g, h, i)
MySQL can use it to answer queries based on a , or a and b , or a and b and c , or a and b and c and d . But it cannot use big_index to process a query based on e , or c and f , or g and i because none of those sequences start at the leftmost side of the index.
The bottom line is that composite indexes can often be used to accelerate some complex queries. But you need to understand their limitations and you should always perform some testing instead of simply assuming that such an index will help.
Use Indexes to Speed Up Queries
When MySQL tries to answer a query, it looks at a variety of statistics about your data and decides how to find the data you want in the fastest way. For the query used in the previous section, MySQL will read all the titles in the albums table and compare them with “Billboard Top Hits –1984″ to see if they match. It is not sufficient to stop as soon as it finds a match, as there may be more that one album with the same title (you might have 12 CDs titled “Greatest Hits,” for example). As a result, MySQL must read every row in the table. This is commonly called a “full table scan” and is something to avoid.
You should avoid full table scans for a couple of reasons:
CPU Overhead: The process of checking all those titles is fairly quick if you don’t have many albums. But what if you need to store a lot of albums in your database? The more albums you have, the longer it takes. There’s a linear relationship between the number of albums in the list and the amount of time it takes to check them all.
Concurrency: While MySQL is reading data from a table, it locks the table so that nobody else can write to it, but they can read from it. When MySQL is updating or deleting rows in a table, it locks the table so that nobody else can even read from it.
Disk Overhead: On a large table, a full table scan will consume a lot of disk I/O. This can significantly slow down your database server — especially if your server has slower IDE drives.
It’s best to keep full table scans to an absolute minimum — especially if your application will need to scale in size, number of users, or both. The very latest versions of MySQL do have several concurrency improvements (BDB, InnoDB, and Gemini table types), but they’re beyond the scope of this discussion.
This is where indexing can help. Simply put, an index allows MySQL to quickly determine if any given value, such as “Billboard Top Hits — 1984,” will match any rows in a table.
How does it do that? When you tell MySQL to index a particular column, it creates another data structure (an index) behind the scenes and uses it to store some extra information about the values in the indexed column. (The indexed values are often called keys.) While this is a bit of a simplification, MySQL stores all the keys sorted in a tree data structure. This data tree structure allows MySQL to find particular keys very quickly.
When MySQL finds that there is an index on a column, it will use the index rather than performing a full table scan. This saves CPU time (not having to read all possible values) and disk I/O, and it improves concurrency because MySQL will only lock the table long enough to get the rows it needs (based on what it found in the index). When you have a lot of data in your tables, the resulting improvements can be very significant!
An updated CREATE TABLE statement for the albums table appears in Figure Three.
CREATE TABLE albums (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(80)NOT NULL,
INDEX title_idx (title)
As you can see, that statement simply added an INDEX line to the end of the definition that told MySQL to create an index called title_idx on the title column of the albums table. You can add several indexes to a table just as you can have several columns in your tables. Single indexes may also be composed of multiple columns (see the Composite Indexes sidebar, pg. 35).
To add an index to an existing table rather than recreating the table from scratch, you can use the ALTER TABLE command:
ALTER TABLE albums ADD INDEX title_idx (title)
The rules that MySQL uses for deciding exactly how to fetch data can become difficult to understand if your query is complex. Fortunately, there are a few general rules and a command that allow you to get a better understanding of what it is doing. First, the rules:
MySQL will not use an index if it decides that it would be faster to simply scan the entire table. Generally, if an index tells MySQL to access roughly 30 percent of the rows in a table, it abandon the index and simply perform a full table scan.
If multiple indexes may be used to satisfy a query, MySQL will use the most restrictive one — that is, the one which results in the fewest rows being fetched.
If the columns you are selecting are all part of an index, MySQL may read all the data you need directly from the index and never touch (or lock) the table itself.
When joining several tables, MySQL will read data from the table that is likely to return the fewest rows first. The order in which you specify the tables may not be the same order in which MySQL uses them. This also affects the order in which the rows are ultimately returned to you, so be sure to use an ORDERBY clause in your query if you need the rows in a particular order.
Having said all that, it’s important to realize that some of the decisions MySQL makes are actually based on guesses. And just like humans who make lots of guesses, it occasionally makes a wrong one.
If you suspect that has happened or just want to understand what MySQL is doing to process a query, you can use the EXPLAIN command. Simply add the word EXPLAIN to the beginning of your query and ask MySQL to execute it. Instead of executing the query, MySQL will report back the list of indexes that are candidates for helping the query and what it knows about them.
A full discussion of EXPLAIN ‘s output is beyond the scope of this article, but it is well documented in the MySQL manual. See “How MySQL Uses Indexes” and ” EXPLAIN Syntax” in the Resources sidebar (pg. 38) for more details and examples.
Don’t Overuse Indexes
Given that indexes make queries faster, you might be inclined to index every column in your tables. But the performance boosting benefits of indexes do come at a price. Each time you INSERT , UPDATE , REPLACE , or DELETE a record in your table, MySQL has to update any indexes on the table to reflect the change(s).
So how do you decide when to use them? The most common answer is “it depends.” It depends on what types of queries you run and how often you run them. It depends on your data. It depends on your expectations and requirements. You get the point — it depends on many things.
The reason for having an index on a column is to allow MySQL to narrow its searches for matching rows as quickly as possible (and to avoid the full table scan). You can think of the index as containing one entry for each unique value in the column. In the index, MySQL must account for any duplicate values. These duplicate values slightly decrease the efficiency and usefulness of the index.
So before you index a column, consider what percentage of the entries are duplicates. If that percentage is rather high, you might not see any improvement with an index.
To put this in more concrete terms, it makes a lot of sense to index the title field in the albums table because there is likely to be a lot of variety in the values and very little duplication. But if you had a genre column in the albums table, there might be little value in indexing it. How many different genres does the average CD collection contain? There are bound to be many duplicate genre entries.
Another thing to consider is how often your index might be used. MySQL can only use the index for a particular column if that column appears in the WHERE clause in a query. In attempting to answer the query:
SELECT * FROM albums WHERE id = 500
MySQL cannot use the index on title. The query is asking MySQL to find records based on their id, not their title.
If you rarely use a column in the WHERE clause of your queries, it is probably not worth indexing that column. It is probably more efficient overall to suffer the full table scan in those rare instances than to ask MySQL to keep the index updated each time the data changes.
See the Query Processing sidebar for more information on what MySQL considers when processing a given query.
When in doubt, test. You can always run some benchmarks with and without an index to see which is faster. Just try to make your benchmark realistic. If 20 percent of your queries are UPDATE s and 80 percent are SELECT s, be sure that your benchmarks reflect that.
Benchmarking is a large topic that could probably consume an entire article of it’s own. Until that happens, you can find pointers to the MySQL benchmarking tool “MySQL Super Smack” in the Resources sidebar. You’re also certainly welcome to discuss the topic on the MySQL Users mailing list.
Use Replace Queries
There may be cases in which you want to INSERT a record into a table, unless it is already there. If the record is already there, you want to UPDATE it. Rather than writing the logic to do this in your code and having to run several queries, use MySQL’s REPLACE to do the job.
If the album with an id of 6 was supposed to have the title “Shaking the Tree,” you might write this query:
REPLACE INTO albums VALUES (6, ‘Shaking the Tree’)
It is important to understand how REPLACE determines whether or not a record is already in the table. MySQL will use any PRIMARY KEY or UNIQUE INDEX on the table to perform the check. If neither exist, the REPLACE effectively becomes an INSERT .
Use Temporary Tables (Wisely)
When working with very large tables, you may occasionally need to run many queries against a small subset of a very large amount of data. Rather than run those queries against the whole table(s) and make MySQL find the few records you need each time, it may be faster to select the records into a temporary table and then run your queries against that table.
Creating a temporary table is as easy as adding the word TEMPORARY to a normal CREATE TABLE statement:
CREATE TEMPORARY TABLE tmp_table (
name VARCHAR(10) NOT NULL,
value INTEGER NOT NULL
A temporary table will exist for the duration of your connection to MySQL. When you disconnect, MySQL will automatically remove the table and free up the space it had used. You can, of course, drop the table while you are still connected to free up the space:
DROP TABLE tmp_table
If a table named tmp_table already exists in your database when you create a temporary table named tmp_table , the temporary table will essentially mask (or hide) the non-temporary tmp_table .
MySQL also allows you to specify that a temporary table be created in memory if you declare it a HEAP table:
CREATE TEMPORARY TABLE tmp_table (
name VARCHAR(10) NOT NULL,
value INTEGER NOT NULL
) TYPE = HEAP
Because HEAP tables are stored in memory, queries you run against them may be much faster than on-disk temporary tables. However, HEAP tables are a bit different from normal tables and have some specific limitations of their own. See the MySQL manual for details.
As with previous suggestions, you should test temporary tables to see if they are really faster than just running your queries against a large set of data. If the data is well indexed, temporary tables may not be any faster.
Temporary tables are only available in MySQL 3.23.0 and newer versions.
Use a Recent Version of MySQL
As of mid-January, MySQL 3.23.xx has been declared stable. In addition to having many new features, it is also faster and more scalable than the 3.22 series. In future articles on MySQL, we will take a look at some of the new features and how you might be able to take advantage of them.
Updated releases of MySQL are coming out all the time. If you have control over your MySQL server, it’s best to try to stay relatively current. Aside from having all of the latest features and bug fixes, you’ll often see significant performance increases with newer MySQL releases.
You can learn more about new releases by monitoring Freshmeat or by joining the MySQL Announcements mailing list.
Other Sources of Help
In addition to reading the MySQL manual, you might also consult with:
Your DBA: If you are using MySQL on a shared server, there’s a good chance that someone has been designated as the Database Administrator (DBA). If your DBA is very familiar with MySQL, he can probably lend a hand in analyzing and optimizing your table structures and queries.
The MySQL Users Mailing List: There is a very active mailing list on which MySQL users routinely ask each other for help. Beginners and seasoned users alike share their knowledge and help each other out with common problems. In fact, some members of the MySQL development team monitor the list as well (again, see the Resources sidebar).
The MySQL Book: Not to be confused with the MySQL manual, Paul DuBois has written an excellent book titled MySQL (ISBN: 0735709211, published by New Riders). It is generally regarded as the book to have for MySQL.
Until Next Time
There is a lot more to this topic, but hopefully these ideas and examples have illustrated some of the major points in a way that helps out MySQL newbies and long-time users alike.
Of course, there’s no way that we can cover everything there is to say about a topic as complex as MySQL in just one article. So make sure to keep an eye out for more articles (on topics such as server tuning and gathering performance statistics from MySQL servers) in upcoming issues.