dcsimg

Data Reduction, Part 1

Roughly a year ago, we spent two months looking at logging web hits in MySQL, using Apache and mod_log_sql. In the October 2002 issue (available online at http://www.linux-mag.com/2002-10/lamp_01.html), we looked at what Chris Powell's mod_log_sql does for you and tried a basic configuration. (After that article appeared, Chris released a new version that fixed a few bugs we discovered in the process of writing that article. Consider upgrading if you haven't already.) Then in November 2002 (http://www.linux-mag.com/2002-11/lamp_01.html), we started building a basic web interface in PHP to present a view of the logged data. Using that framework, you could construct pages to list the most popular URIs, referers, and so on -- all in real-time. That, after all, is part of the beauty of mod_log_sql. You get the benefits of an SQL interface without any unnecessary delays.

Roughly a year ago, we spent two months looking at logging web hits in MySQL, using Apache and mod_log_sql. In the October 2002 issue (available online at http://www.linux-mag.com/2002-10/lamp_01.html), we looked at what Chris Powell’s mod_log_sql does for you and tried a basic configuration. (After that article appeared, Chris released a new version that fixed a few bugs we discovered in the process of writing that article. Consider upgrading if you haven’t already.) Then in November 2002 (http://www.linux-mag.com/2002-11/lamp_01.html), we started building a basic web interface in PHP to present a view of the logged data. Using that framework, you could construct pages to list the most popular URIs, referers, and so on — all in real-time. That, after all, is part of the beauty of mod_log_sql. You get the benefits of an SQL interface without any unnecessary delays.

At least that was the theory.

One thing we neglected to discuss or consider was the long-term consequences of using mod_log_sql. What happens as you accumulate more and more data? Sure, we mentioned the possibility of adding indexes to make some queries faster, but indexes only help so much. The real problem is that the default logging format for mod_log_sql is very simple, so as to be efficient for real-time logging. The tradeoff, however, is that it’s not space efficient. As time goes on, the data can become quite large, difficult to manage, and slow to query.

This month and next, let’s look at a recently implemented solution to this problem — one that should provide much better long-term storage for recording web traffic.

How Bad Was It?

After collecting roughly two years worth of traffic data, representing just under ten million hits, the database was 1.8 GB. Worse, the index file alone was nearly 400 MB. The poor server running this had only 512 MB of RAM, and also ran numerous other services in addition to MySQL and Apache: Exim for email, ssh, SpamAssassin, Courier IMAP, bind for DNS, and so on. And there were about 20 other sites hosted on this server too, all using mod_log_sql.

As a result, RAM was at a premium, and there was a lot of disk I/O involved anytime MySQL had to examine even a small percentage of the records. Queries weren’t as fast as they really should be, and I/O-intensive queries would bring the whole system to a crawl because the underpowered 1U server had only one logical disk (two physical disks in a RAID-1 array). Clearly the storage format had to be re-evaluated.

As a quick refresher, let’s look at the default table structure used by mod_log_sql. The table is shown in Figure One.




Figure One: mod_log_sql table structure
























FieldTypeNullKeyDefaultExtra
agentvarchar(255)YESNULL
bytes_sentint(10) unsignedYESNULL
cookievarchar(255)YESNULL
request_filevarchar(255)YESNULL
referervarchar(255)YESNULL
remote_hostvarchar(50)YESMULNULL
remote_lognamevarchar(50)YESNULL
remote_uservarchar(50)YESNULL
request_durationsmallint(5) unsignedYESNULL
request_linevarchar(255)YESNULL
request_methodvarchar(6)YESNULL
request_protocolvarchar(10)YESNULL
request_timevarchar(28)YESNULL
request_urivarchar(255)YESMULNULL
server_portsmallint(5) unsignedYESNULL
ssl_ciphervarchar(25)YESNULL
ssl_keysizesmallint(5) unsignedYESNULL
ssl_maxkeysizesmallint(5) unsignedYESNULL
statussmallint(5) unsignedYESNULL
time_stampint(10) unsignedYESMULNULL
virtual_hostvarchar(50)YESNULL

The only non-default customization you see in Figure One is the addition of a few indexes. The remote_host, request_ uri, and time_stamp columns were all indexed so queries could be constrained by time, IP address, or URI.

The main problem with this table structure is a huge duplication of data. To solve this, we decided to break up the data into multiple tables to dramatically reduce the space required. The goal was to end up using a mere ten percent of the 2.4 GB without sacrificing many features.

Normalization Needed

The logical piece of data to break out are those that require a lot of space and have a relatively finite number of values. Given a finite number of values, there’s bound to be lots of repetition. In this table, the easy candidates were agent, request_ uri, and referer.

However, hacking up the code to mod_log_sql would make it quite a bit more complex and result in either greater CPU or memory usage (or both, depending on the implementation). Instead, we opted for a hybrid approach. We chose to keep the access_jeremy_zawodny_com table in place without any alterations. It stores only the most recent data (1-2 days). All older data is migrated to a set of much more compact tables: agent, hit, referer, and uri. These tables exist in a separate database conveniently named access_jeremy_zawodny_com. To create the new tables, we used the SQL script shown in Listing One.




Listing One: SQL to create mod_log_sql archive

CREATE DATABASE jeremy_zawodny_com;
USE jeremy_zawodny_com;
DROP TABLE IF EXISTS agent;
DROP TABLE IF EXISTS hit;
DROP TABLE IF EXISTS referer;
DROP TABLE IF EXISTS uri;

CREATE TABLE agent (
id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT
PRIMARY KEY,
name VARCHAR(255) NOT NULL,
UNIQUE (name)
) Type = MyISAM;
CREATE TABLE referer (
id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT
PRIMARY KEY,
name VARCHAR(255) NOT NULL,
UNIQUE (name)
) Type = MyISAM;
CREATE TABLE uri (
id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT
PRIMARY KEY,
name VARCHAR(255) NOT NULL,
UNIQUE (name)
) Type = MyISAM;
CREATE TABLE hit (
time_stamp INTEGER UNSIGNED NOT NULL,
host CHAR(15) NOT NULL,
status SMALLINT NOT NULL,
bytes_sent INTEGER UNSIGNED NOT NULL,
agent_id INTEGER UNSIGNED NOT NULL,
referer_id INTEGER UNSIGNED NOT NULL,
uri_id INTEGER UNSIGNED NOT NULL,
) Type = MyISAM;

As you can see, the uri, agent, and referer tables share a common structure. Each is a simple lookup table with an auto-increment id to map the the larger name value. There’s a unique index on the name column to prevent duplicates and to facilitate fast lookups by name.

The hit table contains all the leftover data that we’ve decided to keep. The time_stamp and status columns are unchanged, except for being marked NOT NULL. After all, every request gets a time stamp and a status. The host column has been shrunk to a fixed fifteen bytes because we’re storing IP addresses rather than hostnames anyway. The longest possible IP address (in dotted decimal notation) is four sets of three numbers and three dots, or 15 characters. Finally, we have three id columns that link each hit with the appropriate agent, referer, and URI.

There’s also a fair amount of data we decided not to preserve when moving from the mod_log_sql form to this new compact form. SSL related items, for example, aren’t interesting to keep. The same goes for a lot of the request_ fields.

Aside from saving space, another side benefit of this design is that the hit table is composed of all fixed-length fields and therefore fixed-length rows. MySQL’s MyISAM tables have some handy optimizations when dealing with fixed-length rows. Lookups are faster, concurrency is improved, and you can INSERT new rows while also running SELECTs against existing rows if there are no “holes” in the table. We’ll never have holes because we do not plan to ever run a DELETE query against the table.

The result of this new design is that each new hit requires only 38 bytes of space in the table. That means we can store roughly 27,000 hits in 1MB of space. Quite compact.

Next Steps

With the new design in place, the next task was to think about migrating the existing ten million records to the new database and creating a way to do this on a regular (nightly?) basis as new data rolls in.

We’ll pick up there next month, looking at a single script that you can use to perform both tasks. Meanwhile, think about other ways to squeeze the data into even fewer bytes. (Once the conversion is done, we’ll also look at an effective strategy. In fact, we’ll get each hit down to a mere 27 bytes.)



Jeremy Zawodny plays with MySQL by day and spends his spare time flying gliders in California and writing a MySQL book for O’Reilly & Associates. Reach him at Jeremy@Zawodny.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62