What are a few of the things you really need to keep in mind when starting that next big MySQL project?
About that Caching Layer
It’s no longer that case that the typical site is a vanilla LAMP stack with Linux, Apache, MySQL and your favorite “P” scripting language (Perl, PHP, or Python). Not only a lot more languages out there nowadays (Ruby, for example), but there’s a new layer in the stack: caching. It’s become fairly common to use memcached as a place to store transient data that’s too expensive to fetch from MySQL. Performance is excellent. With gigabit networks connecting 64 bit machines that have anywhere from 16 to 64GB of RAM and no disk involved, you easily get response times measured in the single digit milliseconds.
Memcached isn’t the only came in town. There are numerous Open Source Distributed Hash Table (DHT) projects working to make fast and scalable systems for storing and retrieving arbitrary key/value pairs.
Plan for a caching layer in your application. Instead of writing code that directly queries the database for data that rarely changes or data you expect to re-use frequently, either within or across user sessions, hide that logic behind a function. Doing so provides a layer of separation that makes it far easier to begin making use of a caching layer (since you’ll have a single library of functions to update), even if you don’t actually have memcached or a DHT of some sort running from the start.
OLTP vs. OLAP (or External vs. Internal)
Relational databases are often used in two very different (nearly opposite, in fact) roles: Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). If you’re not sure which of those two you need, the answer is probably “both” and that’s part of the problem. Typical “consumer” web applications are OLTP applications from the user’s point of view. Most queries are going to use indexes so they need to examine very few rows in a given table. Depending on the nature of the site, there may be a mix of read and write queries occurring as the user interacts with the site. But at any given time, and for any given user, only a very small slice of the total dataset is used.
However, needs on the back end are often quite different. The people who keep things running behind the scenes often want to ask interesting questions about the aggregate data. That means you end up building reporting tools that attempt to answer those questions without bringing the site to it’s knees. And over time that gets harder and harder because the queries end up needing to look at a substantial portion of the data (if not all of it). The data size continues to increase and the questions get harder to answer.
Eventually you’ll come to realize two things. First, and this doesn’t take long, you need a dedicated machine on which to host this internal reporting database. Having a separate slave means that no matter how gnarly things get, the live site is not impacted. That’s easy. But the larger issue is that traditional OLTP and OLAP database schemas are built differently with different goals in mind.
It’s common to see a very academically correct design for a reporting (OLAP) database. The star schema is a common pattern to follow for data-mining and reporting applications. But on the front side you’re more likely to keep the number of tables small and forego some normalization in favor of keeping queries simple for better performance.
If you anticipate this sort of scenario playing out, it’s best to think about how to truly separate analytics from the low-latency user-facing queries. You may find that the majority of back-end analysis only needs to look at the previous day’s data. In cases like this, it’s pretty straightforward to build a mechanism to extract that necessary data, load it into a database designed for handling those queries.
It sounds obvious, but a little bit extra thinking and planning early on in the life cycle of a database-backed Web application can make your life a lot easier down the road. Give some serious though the role of the database and write it down. Be prepared to say “no” when things begin to creep too far. Make sure you plan for caching the first time you begin coding queries in your app. And don’t let your internal analytics needs dictate the schema or features of the main production database.