Cleaning Out a Logging Database

The Apache Web server that handles the www.stonehenge.com domain logs its transactions directly to a MySQL database using a mod_perl handler. This is really cool, because I can perform statistical correlations on hits over the past few months, including such complex things as the average CPU time used for a particular URL (to see if some of my dynamic pages need better caching) and the greatest number of referrers to a particular page.

The Apache Web server that handles the www.stonehenge.com domain logs its transactions directly to a MySQL database using a mod_perl handler. This is really cool, because I can perform statistical correlations on hits over the past few months, including such complex things as the average CPU time used for a particular URL (to see if some of my dynamic pages need better caching) and the greatest number of referrers to a particular page.

However, it’s a bit uncool because even for a moderate site like mine, the storage requirement to hold information about each of the hits over time can grow rapidly (about 100 MB per month at the current hit rate).

Every once in a while, my ISP admin would yell at me and tell me to reduce the size of my MySQL usage. So I’d type a few commands to roll out the oldest data into a second temporary table, then mysqldump that table into a flat file, compress the file, and then delete the information from the original table. If I was feeling particularly nice, I’d also perform an OPTIMIZE TABLE as well. The compressed file takes up far less space than the original file (about one-fifteenth or so), so I could safely store the historical data for a lot longer should I ever want to restore it all for a grand archaeological trip.

But there are many steps to this procedure, and because it was performed infrequently, and I wasn’t smart enough to take notes about what I was doing, I always found myself staring at the MySQL online documentation for 10 to 15 minutes. Once I was sure of the commands I was typing, I would proceed, cursing my ISP for not providing me with unlimited disk space.

So it got to be “that time of the month” again a few days ago, and I decided once and for all to automate the procedure. Of course, this turned out to be a bit silly, because the work it took to automate this monthly rollout took about the same amount of time it would have taken to do this procedure for a year. However, just in case a few of you can benefit from this technique (or perhaps a similar one), I’ll pass along the program to recoup my investment.

I’ve tried to make the program rather generic. Although it’s specific to MySQL, it should work with any table that has a date-time stamp that increases with time.

So, without further delay, I introduce my so-called time-saver in Listing One.

Line 1 turns on warnings. Line 2 turns on compiler restrictions (no symbolic references, variables must be declared, and barewords are not permitted). Line 3 disables output buffering; what little output we generate, we want to see right away.

Lines 5 to 10 define the configuration parameters, making this program somewhat reusable for other applications. $DIR is the directory in which the output files will be stored. $DB is the MySQL database name, and $TABLE defines the table within that database that will be rolled out. $STAMP is a date-time field within the $TABLE table that will be used to select the records of interest for archival. Finally, $DB_USER and $DB_PASS define the MySQL user that has table creation and access privileges for that database. Of course, I’ve replaced my live data for some of these values with stand-in values. I’ll let you guess which ones.

Line 12 pulls in the DBI module (installed from the CPAN). In addition, you’ll also need DBD::mysql installed, which DBI pulls in.

Line 14 puts us into the right directory, permitting my program to be run from cron without a special directory change in the invocation.

Lines 16 and 17 connect to the database, using the selected username and password and database name. In addition, I’ve enabled the RaiseError parameter, causing any significant error to automatically throw an exception (die) and disabled the automatic printing of error messages. This keeps me from having to add an “or die…” after each database call.

Line 18 is a bit of voodoo programming. One time when I was dealing with large tables in SQL, I got some sort of table overflow, and it aborted my program. I found this command in the manual and now ritually include it in each program, a gift to the deities to keep from bombing out. I have no idea when it’s needed, and the manual isn’t clear about that, and I still don’t know why it isn’t the default. But there it is.

Lines 21 to 33 determine the temporary table name and the upper bound for the date range. These are determined dynamically by looking at the oldest entry in the current records.

First, lines 22 to 23 get the Unix Epoch time value for the oldest entry as the number of seconds since the Unix base value. Line 25 dumps that out for logging purposes by using localtime in a scalar context.

Next, lines 27 to 30 compute a “current month” and “next month” pair of values based on that stamp, again using localtime, but now in a list context. The year number from localtime is offset by 1900, and the month by 1, so we will have to add those back in line 28. The ending month is initially the start month plus 1, but if that exceeds the end of the year, we will simply adjust the year and month appropriately.

Lines 31 and 32 format the current month values for the archival table name and the next month for the MySQL date string for the first of the month. These two values end up in the outer $table_name and $before values, respectively; this is shown in line 35.

Lines 38 to 48 get the table definition (or at least a portion of it) so that we can create the archival table similar to the original table. First, line 39 creates a statement handle asking MySQL for the table structure, which is executed in line 40.

Line 41 is a bit of magic, so let’s go over it slowly. First, from right to left, I’m declaring a %col variable. A reference to that variable forms the sole element of a list that the backwards for-each statement is now processing. So $_ is equal to that hash reference as we execute the bind_ columns call.

From the inside-out there, we’ll start with NAME_lc element of the $info hashref, which has the names of the columns, lowercased for consistency. This returns an array ref that we dereference to form the keys of a slice created by dereferencing $_, which means a hash slice of %col.

Whew! Almost there. The “take a reference to” operator is applied to this hash slice, which gets a list of references to the values of that slice, and those are fed as a list to bind_columns.

So, the net effect is, $col{field} is automatically bound to the Field column on each fetch. Wow! Okay, there’s 17 other ways to do this, but it’s pretty slick (or sick?) when you see how it works.

Lines 43 to 46 grab the column definitions by constructing an array of the fields and types for each of the columns in the target table. This is not a complete representation of the table, but permits us a safe place to copy the data and restore it later, without losing any precision or meaning.

Line 47 returns the column definition, which ends up in $cols from line 38. I use a do block here, because it permits me the opportunity to create some temporary variables while I’m deciding what the initial (or only) value of a variable should be, and to discard those variables once I know the answer.

Beginning in line 50, we’ve got the code to start moving real data around. First, we make the archival table using the definition computed earlier. If anything happens to break in lines 51 or 52, we’ll be out of the program with a fatal untrapped die.

But once we’ve got the temporary archival table, we want to be nice to the database and clean it up if something breaks; so starting in line 53, we’ll stay within an eval block to trap those errors.

Lines 54 to 57 copy the data from the live table to the archival table and show how many rows were affected.

Lines 59 to 62 verify that there are that many lines in the table, just to be double safe.

Lines 64 to 66 show the date range of the data to be archived. This is not really needed for the calculations; it’s just there to keep me feeling comfy about how much data I’m storing away.

Lines 69 to 81 call the mysqldump utility with the necessary arguments to extract the archival table into a compressed flat file. Since we’ll want to control the output of the child process precisely to feed it through the compressor, we’ll have to fork ourselves manually. Line 75 computes the output file and aborts in line 76 if the file already exists. All through this program we’re playing it safe, perhaps excessively so. The arguments in line 79 define the username, password, database, and table name for the dump, which will end up being passed through an aggressive gzip (set up in line 77) to generate the file.

If all went well, we continue through lines 84 to 86, deleting the data from the original live data table and finally dropping the temp table in line 89. If that’s all good, then a final OPTIMIZE TABLE in line 92 restores some of that newfound empty space to the recycle bin.

If there was any abort in the middle, we’ll pop down to line 93 and show the error as a warning. Line 95 tries to delete the table one more time, in case line 89 was skipped, and then we’re out of there in line 97.

We’ve now got a program that automates all those steps I was doing by hand. When my ISP admin yells at me about space, I can automatically peel off the oldest month’s worth of material with this script, gaining that much more time to respond to my e-mail. And there you have it. Until next time, enjoy!

Listing One: Randall’s Timesaver — Part I

1    #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 my $DIR = “/merlyn/web/OldLog”;
6 my $DB = “merlyn_httpd”;
7 my $TABLE = “requests”;
8 my $STAMP = “when”;
9 my $DB_USER = “–guess–”;
10 my $DB_PASS = “–not–”;
12 use DBI;
14 chdir $DIR or die “Cannot chdir to $DIR: $!”;
16 my $dbh = DBI->connect(“dbi:mysql:$DB”, $DB_USER, $DB_PASS,
17 { PrintError => 0,
RaiseError => 1 });
18 $dbh->do(“SET OPTION SQL_BIG_TABLES = 1″);
20 ## first, get start and end for dates
21 my ($table_name, $before) = do {
22 my $lowest_epoch = $dbh->selectrow_array
25 print “$lowest_epoch = “.localtime($lowest_epoch).”\n”;
27 my @low = localtime($lowest_epoch);
28 my ($start_year, $start_month) = ($low[5] + 1900, $low[4] + 1);
29 my ($end_year, $end_month) = ($start_year, $start_month + 1);
30 $end_year++, $end_month = 1 if $end_month > 12;
31 sprintf(“${TABLE}_%04d_%02d”, $start_year, $start_month),
32 sprintf(“%04d-%02d-01″, $end_year, $end_month);
33 };
35 print “table is $table_name, before is $before\n”;
37 ## now get table definition
38 my $cols = do {
39 my $info = $dbh->prepare(“DESCRIBE $TABLE”);
40 $info->execute;
41 $info->bind_columns(\@$_{@{$info->{NAME_lc}}})
for \my %col;
43 my @coldefs;
44 while ($info->fetch) {
45 push @coldefs, “$col{field} $col{type}”;
46 }
47 join(“,”, @coldefs);
48 };
50 ## transfer live data to temp table
51 $dbh->do(“DROP TABLE IF EXISTS $table_name”);
52 $dbh->do(“CREATE TABLE $table_name ($cols)”);
53 eval {
54 my $count_insert =
55 $dbh->do(“INSERT INTO $table_name SELECT *
56 undef, $before);
57 print “inserted $count_insert rows\n”;
59 my $count_selected =
60 $dbh->selectrow_array(“SELECT count(*) FROM $table_name”);
61 print “transferred $count_selected rows\n”;
62 die “mismatch!” unless $count_selected == $count_insert;
64 my ($min_date, $max_date) =
65 $dbh->selectrow_array(“SELECT MIN($STAMP),
MAX($STAMP) FROM $table_name”);
66 print “dates range from $min_date to $max_date\n”;
68 ## use mysqldump to create file
69 print “dumping…\n”;
70 defined(my $kid = fork)ordie “Cannotfork:$!”;
71 if ($kid) { # parent
72 waitpid($kid,0);
73 die “bad exit status: $?” if $?;
74 } else { # kid
75 my $file = “mysql_$table_name.gz”;
76 die “$file exists, aborting\n” if -e $file;
77 open STDOUT, “|gzip -9 >$file” or die “gzip: $!”;
78 exec “mysqldump”, “–opt”,
79 “–user=$DB_USER”, “–password=$DB_PASS”, $DB, $table_name;
80 die “Cannot exec: $!”;
81 }
83 ## delete original data, and drop table
84 print “deleting…\n”;
85 my $count_delete =
86 $dbh->do(“DELETE FROM $TABLE WHERE $STAMP < ?”, undef, $before);
87 print “deleted $count_delete rows\n”;
89 $dbh->do(“DROP TABLE IF EXISTS $table_name”);
91 print “optimizing…\n”;
92 $dbh->do(“OPTIMIZE TABLE $TABLE”);
93 }; warn “insert block: $@” if $@;
95 $dbh->do(“DROP TABLE IF EXISTS $table_name”);
# in case it didn’t happen
97 $dbh->disconnect;

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.