Watching the Web

Everyone that runs a web site wants to know “How am I doing?” The total number of hits, the number of unique visitors, and what pages are the most popular are just a few of the metrics that gauge a site’s traffic. All of that important data exists in the web server’s log — if only you can tease it out. While several commercial applications provide such analysis, The Webalizer is a free and fast log analyzer that may just be superior, too. Here’s a hands-on guide.

Nearly everyone who runs a web site wants to know “How am I doing?” How many hits am I getting? Which of my pages are the most popular? How are visitors finding my site? Is anybody else linking to me? Of course, all of the answers are captured in the web server’s logs, so the question ultimately becomes “What analysis software should I use?”

There are lots of choices. There are several expensive commercial solutions, such as Webtrends (which recently dropped Linux support), and an even larger number of open source options. Most of the free solutions are written in Perl — a natural choice for a text processing-intensive job like log analysis, but also a CPU hog.

So let’s take a look at an alternative written in C. The Webalizer, written by Bradford Barrett and available at http://www.webalizer.com, was designed for speed: it can process log data much faster than anything written in Perl, a good thing for busy web sites with huge log files and already-taxed CPUs. Webalizer typically runs out of cron at regular intervals — hourly or daily — and it creates customnized HTML reports that can be viewed from any browswer.

Building and Installing The Webalizer

The Webalizer web site offers many binary distributions and the installation instructions are good.

Webalizer is composed of just four files: the main webalizer executable, a symbolic link to it from webazolver, a manual page, and a sample configuration file. Installation is usually done by root, but doesn’t strictly need to be.

Webalizer requires the usual software development packages, plus two other open source libraries: the gd graphics libraries (http://www.boutell.com/gd) and the Berkeley DB library (http://www.sleepycat.com). gd creates the image files for the Webalizer graphs. It requires a few other commonly-used libraries: zlib, jpeg, and libpng. Berkeley DB stores data for DNS inverse name lookups.

If your distribution does not include those libraries by default, the software is readily available in any number of package formats. Or, if you prefer to build from source, enable version 1.85 compatibility in Berkeley DB with the ––enable-compat185 parameter. (Without this, The Webalizer almost compiles.) When building Webalizer itself, the configure program should be passed the ––enable-dns parameter if DNS reverse resolution is desired, and some platforms require setting export LIBS=–lpthread in the environment. Additionally, FreeBSD needs export CFLAGS=’–DHAVE_DB_185_H’.

Initial Configuration

Webalizer’s run-time behavior is determined by configuration files and/or the command-line, and only a few of its settings are truly crucial to proper operation. The rest of its options are “tuning” parameters that can be adjusted to taste. However, some configuration settings can cause surprises, so understanding the exact order of processing can forestall them.

When the webalizer program runs and before it examines command-line parameters, it attempts to load two configuration files. First, it looks for webalizer.conf in the current directory, and if that’s not found, it attempts to load /etc/webalizer.conf. Then, when processing the command line, it loads a configuration file specificed with –c. That means that that providing –c webalizer.conf on the command-line loads that file twice: once by default, then once by specific request, and this can create some surprises with options that are cumulative (especially those for HTML formatting of the output page).

There are two settings that are always required, and Webalizer won’t run without them. You must specify the name of the log file to process and the name of the output directory. Once you understand these parameters, you can do a “test run” and then start tweaking the output from there.

*The name of the log file to process can be provided in the configuration file — using the LogFile keyword — or on the command line, and filenames ending in .gz are automatically expanded by GNU unzip. This allows historical log files to be compressed to save space. Only one logfile is processed at a time, and additional names provided on the command line are silently ignored. If no log file is specified anywhere, standard input is read.

*Once Webalizer has processed all of the input (both log files and history files), it creates output in the form of HTML pages, graphic images, and perhaps history or status files. These go to the directory specified by the OutputDir configuration file entry or specified by the –o directory command-line option. In most cases, the output directory is simply set to stats, which is relative to the current directory.

Here’s one convenient trick. On a typical web server, the log files are kept outside of the web root (perhaps rightly so, as these are probably not public information), but it’s nevertheless convenient to make the logs and the stats “neighbor” directories with symbolic links:

# mkdir /home/apache/webroot/stats
# cd /home/apache/logs
# ln –s ../webroot/stats stats

Now the …/webroot/stats and …/logs/stats directories refer to the same place, so running the stats from the logs directory just does the right thing.

# cd /home/apache/logs
# webalizer access_log

This is likely to produce a lot of warnings about ill-formed log entries. However, these can be ignored. When Webalizer finishes, point a web browser to http://www.your-domain/stats to see the very first version of your web statistics. Next, you can move on to fine-tune the output.

Dissecting Droves of Data

One of the first questions that you’ll want to ask is “Which of my pages is most popular?” The information is certainly there in the log files, but the rankings are likely cluttered with housekeeping files, including site logos, decorative images, stylesheets, icons, and the like. To yield the most popular meaningful pages, those “extraneous” files must be omitted.

To exclude a file entirely from the rankings, use the HideURL directive in the configuration file. The default already excludes most images, but you can add more to the list:

HideURL *.css
HideURL *.ico
HideURL /robots.txt

Here, the HideURL commands exclude all stylesheets and icons, and the robots.txt file that search engine spiders fetch to guide their travels on a site. Some sites may not want to exclude all images — a photography site, for instance — so a bit more fine-tuning may be required to include your real content but exclude the rest.

Next, you may want to know who’s been visiting your site. Usually, you’ll be your own best customer: when building a web site, it’s common to refresh a given page many times while fleshing out a page design, and this dramatically skews the statistics. So, Webalizer allows you to exclude your own visits using IgnoreSite in the configuration file. Here’s an example: IgnoreSite my.ip.address.

As you can see, Hide and Ignore have different purposes. Likewise, the two directives work differently.

The Hide directive suppresses display of data in one section, but the log data producing it is still used in other sections. For instance, you may choose not to show .css files in the” Top URLs” section, but those files still contribute to the total kilobytes used by the sites doing the visiting.

On the other hand, lines matching an Ignore directive are discarded entirely as they are read from the logfile — it’s as if they never existed, and they do not contribute in any way to any statistic.

Some object to the use of Ignore directives, as their use produces statistics that are not in line with what actually happened on the web server. Others counter with the notion that some kinds of data (say, referral spam) contributes nothing to understanding how a web site is doing. But all agree that Ignore should be used sparingly, keeping in mind the full effect the directive has on site statistics.

Tuning Referrers

When a link appears on a web site and a visitor clicks on that link, the web browser typically lets the target web site know where the link came from. This is a referrer, and it allows a web site operator to know who’s linking to its content. Many web site operators watch this keenly. But several issues emerge when considering referrals.

First, the great majority of referrals come from your own site (for instance, your” Contact Us” page is linked from your own home page), and seeing yourself at the top of the referrer list is not usually helpful. Again, you can exclude yourself from the referral listings, as this does for[ the author’s web site at] unixwiz.net:

HideReferrer unixwiz.net

It’s also common to see “the same” referral show up in more than one place due to differences in domain name or URL construction. It’s preferable to show them grouped together. For example, unixwiz.net regularly gets referrals from both postfix.org and postfix.com — yet both names go to the same place. To show the two sites listed as a single entry, use the GroupReferral directive:

GroupReferral postfix.org Postfix.org website
GroupReferral postfix.com Postfix.org website

Now, all referrals from either postfix.org or postfix.com are grouped together on a single line with the label Postfix.org website, and this summarizes all of the traffic received from this one source.

But this is shown in addition to, not instead of, the original entries. You can add two more lines to hide these constituent parts:

GroupReferral postfix.org Postfix.org website
GroupReferral postfix.com Postfix.org website
HideReferral postfix.org
HideReferral postfix.com

In fact, this can be extended more broadly. For example, unixwiz.net, like many others, receives most of its referrals from Google, but Google has so many country-specific sites that the referrals list is cluttered with “google.something” entries. Rather than see all this fine detail, you can use GroupReferrer and HideReferrer to summarize all Google activity into three groups:

GroupReferrer gmail.google.com GMail
HideReferrer gmail.google.com

GroupReferrer google.com Google US
HideReferrer google.com

GroupReferrer google. Google Intl
HideReferrer google.

The first set isolates Google’s gmail, which shows web traffic referred by email. Next is the the main Google site, then all the others (google.ca, google.co.uk, google.de, and so on) grouped as Google Intl. This dramatically reduces the clutter in the referral logs. Those that care to see per-country referrals wouldn’t do this, of course.

One problem that can skew statistics is referral spam (discussed more fully in the sidebar “Stats and Spam.” These annoying referrals often show up in the stats listings, sometimes with relatively high ranks if your server was targetted. Thankfully, you can cause these entries to be disregarded entirely.

IgnoreReferrer get-your-money-fast.com
IgnoreReferrer make-it-bigger.com

Recall that Ignore directives cause the entire log entry (including the URL visitor, visitor’s IP address, the referral, and the rest) to be discarded completely. These bogus entries are not real “traffic” and simply should not be counted.

Other Knobs to Turn

The Webalizer contains many other settings that guide how statistics are presented, though it’s best to change just a few at a time to make it easier to” backtrack” if something doesn’t turn out as expected. Though some of these options can be changed via command-line parameters, you may prefer to organize them in the webalizer.conf file so they’re all grouped together. Many of these options are more or less self-evident from looking at the sample configuration file, but others warrant special notice.

The web pages produced by Webalizer are a bit on the sparse side, and you may like to customize them to “look” like the rest of the site. One can’t really change the structure of the document (text, overall layout, graphic formats), but by using the HTMLPre, HTMLBody, HTMLPost, and other related keywords, you can change the background, text at the header and footer, include meta tags, and so on.

By using the HTMLHead configuration directive, one can import the site’s overall stylesheet into the stats page, which alone can do much to unify the look and feel:

HTMLHead <link rel=”stylesheet” type=”text/css” href=”../site.css”>

One word of caution: all of the HTML* directives are cumulative, so multiple lines are allowed (with all of them being inserted into the appropriate place in the output). If you include these HTML* directives in multiple configuration files, the lists do not get reset for each file and surprising (and broken) output can result.

Not everybody cares for all the sections produced by Webalizer, and it’s possible to disable most of them with simple directives. For example, if you don’t need to know traffic on a per-hour basis, you can use…

HourlyStats no
HourlyGraphs no

… to disable this.

IP-to-Name DNS Resolution

By default, display of visitors (“Sites”) is done by IP address only, but it’s common to request this by domain name (comcast.net, microsoft.com, fbi.gov, and so on). Turning an IP address back into a name is a straightforward (though not always successful) process, and this inverse resolution often involves network delays and timeouts that can be quite lengthy.

To speed this process up, Webalizer can spawn several helper processes to perform lookups in parallel, and the number of these children is given with the –N number parameter. –N 5 is a reasonable number to start with, and –N 0 disables the lookups entirely.

Once an IP address has been mapped to a name, it’s stored in a cache file — a small database — so that the next lookup of that IP address is performed immediately. The name of this file is usually dns_cache.db and specified with the DNSCache directive.

Though DNS lookup substantially slows down processing, the information it provides is seen as useful by many, albeit with one caveat: names looked up from an IP address are not authoritative, and there can be several sources of misleading information. First, some names are associated with not only an IP address but a time, where the mapping in place at the time of the web site visit is different than the mapping performed at a later date. This is particularly more common on internal networks where IP addresses are doled out via DHCP: that access from could have been from STEVEPC at the time, but many hours later could have been reassigned to MARTINPC.

Second, it’s possible that the administrator of any given inverse zone could outright lie about name mapping, and though this isn’t common, it’s not impossible either.

Under the Hood

The Webalizer generally focuses on monthly statistics, with only limited information reported over longer periods. It uses several internal files that help keep it all in order, and the whole notion of “status” and “history” and “incremental mode” has proven confusing to many. Here’s an explanation of each.

First, “history.” Though The Webalizer produces detailed pages only for one month at a time, the top-level page includes a one-line summary for each month that’s been processed in the past. It’s clear that it’s not reprocessing all of the old log files — that would be astonishingly inefficient — so Webalizer is instead relying on a history file to persist this summary data.

This file, usually called webalizer.hist, contains that one-line monthly information, and it’s updated and re-read each time statisticss are regenerated, and is used only for the monthly summary page. This file is quite small, and though it’s possible to turn off history with the –i command-line option or the IgnoreHist directive, there’s usually no good reason to do so.

Next, “status.” It’s easy to collect statistics over a whole month when that entire month’s log data is presented at one time, but it’s a bit more tricky when it’s presented a day at a time. To show stats over the entire month, Webalizer maintains a status file containing counts for all of the visitors, pages, referrers, and all of the other data that it reports.

This file, usually named webalizer.current, can grow quite large over the course of a month (though it’s cleared at the start of a new one). Like the history file, it’s read at the start of each run, updated on-the-fly, and rewritten at the end of processing. This is enabled with the –p command-line option or the Incremental directive.

For regular, production use, both history and incremental modes ought to be used, but it may save some processing time to disable them while testing a configuration file. After all, if everything is rebuilt each time, there’s no state from a previous run to consider, so there’s no point in creating it.

Where surprises can occur is while re-running prior statistics, perhaps to incorporate updates to the configuration file. The Webalizer keeps track of the date and time of the last log entry it processed and ignores those before that. This feature allows you to re-run statistics on an hourly basis with the same log file since the same entries won’t be counted twice. But if folder data is presented to Webalizer, it’s going to be ignored.

It’s possible to provide “ignore history and status” parameters in this case, but it’s usually easier to just remove the two files in question, webalizer.hist and webalizer.current, before processing as usual.

Putting It All Together

Many who prepare web statistics on a regular basis do so in conjunction with log rotation. And to ensure that web server logs do not grow without bounds, they are archived and truncated at regular intervals. This is a natural time for Webalizer to enter the picture.

Running web statistics on a daily basis is usually a fair compromise of

“efficiency” versus “timeliness”, and you should consider running Webalizer

at local midnight to summarizes data for the previous day. This process

can be run out of cron or hooked into the logrotate facility, but

a few aspects are in common with all approaches.

*Most Webalizer options should appear in the configuration file rather than pass them on the command line. This insures that all processing runs — whether done manually or automatically — using the same parameters each time.

*Be sure there is just one webalizer.conf file in use, which is best done by simply relying on that file in the runtime directory and omitting any –c parameter on the command line.

*It’s easy to get into “permissions” trouble when mixing “manual log processing” with “automated log processing.” If the manual user, the logrotate user, and the Webalizer user aren’t compatible, you’re likely to find that Apache or Webalizer aren’t able to update their files. Check email from the system regularly after changes or manual log processing.

Comments are closed.