While most large organizations already have a search feature on their web site, many small- and medium-sized organizations do not. For whatever reason, there's long been a perception that getting good search results on your web site is complicated or expensive. This month's column begins a two-part series about adding search features to your web site.
While most large organizations already have a search feature on their web site, many small- and medium-sized organizations do not. For whatever reason, there’s long been a perception that getting good search results on your web site is complicated or expensive. This month’s column begins a two-part series about adding search features to your web site.
Given the dominance and ubiquity of Google, you might wonder if search is worth spending any time on at all. Usability guru Jakob Nielsen sure thinks so. In his May 2001 Alert Box article titled, “Search: Visible and Simple,” (available online at http://www.useit.com/alertbox/20010513.html) Nielsen says: Search is the user’s lifeline for mastering complex websites. The best designs offer a simple search box on the home page and play down advanced search and scoping.
Nielsen goes on to develop a compelling case for having search on your site — and not just a search link — an actual search box that visitors can use to find what they’re looking for. It’s no accident that there’s a search box at the top left corner of the Linux Magazine web site (http://www.linux-mag.com).
One of the first decisions to make is whether to build search yourself or pay someone else to do it for you.
By doing it yourself, you’ll have complete control over the look and feel of the search results. For example, you can filter documents that you’d rather not see in the results and tweak the look and feel. But you’ll also have to endure the burden of locating, installing, configuring, and maintaining the software.
On the other hand, by paying someone else to handle your search, you’ll spend less time messing with the software, but you’ll likely have to accept a less flexible presentation.
If you try the Linux Magazine search, you might notice that it takes you to a different server (linux-mag.master.com). However, the user interface remains quite similar to the home (linux-mag.com) site. Master.com uses Thunderstone search technology (http://www.thunderstone.com) behind the scenes to provide free and customizable search services.
An Overview of ht://dig
This month we’ll look at setting up ht://Dig (http://www.htdig.org), a GPL’d document search and indexing system. (For the remainder of this article, we’ll refer to ht://Dig simply as HTDig.) Next month, we’ll examine a hybrid model using Google’s search API.
HTDig is a small collection of very powerful tools. Together they provide web crawling, document indexing, index maintenance, and ultimately, search.
Fundamentally, HTDig works like many of the big Internet search engines do. The crawler (named htdig) starts with a set of URLs, and fetches content, extracts links, and follows the links. Using configuration settings, you can restrict the crawler from traveling outside your domain (otherwise it might never finish). Along the way, it gathers statistics about the content of each page.
Once all of the data’s been gathered, the htmerge process combines the data into a document database that htsearch can use. To facilitate more advanced searching, the htfuzzy program allows you to build a “fuzzy” index using a variety of algorithms, including soundex, metaphone, endings, synonyms, and more.
Enough theory. Let’s install HTDig and get going. Let’s start by downloading and unpacking the latest stable version:
Depending on your web server configuration, you may want to set the variables to other directories or use Alias directives in your Apache configuration to make them work.
With all of that out of the way, build and install HTDig:
$ sudo make install
With that done, you’ll have all the necessary files in subdirectories of /htdig/ (or wherever you chose to install it):
$ ls -F /htdig/
bin/ cgi-bin/common/ conf/db/ htdocs/
To configure HTDig, edit the htdig.conf file. This file controls how the various HTDig tools work, where they find data, and so on. Let’s look at the values you’ll likely need to adjust.
database_dir: This is /htdig/db/ by default. HTDig stores all of its data files here. You’ll need to change the location if you don’t have sufficient space on the filesystem.
start_url: The htdig crawler starts here. To index the Linux Magazine web site, you’d set this to http://www.linux-mag.com/ or http://www.linux-mag.com/. In other words, set start_url to the top-most page of your site unless you only want to index a subset of your site.
exclude_urls: By default, this contains /cgi-bin/.cgi. It’s a list of space-separated patterns that htdig will attempt to match every URL against. If a URL matches, htdig does not fetch it. Using this, you can easily keep htdig out of a dynamic area of your site — an area that may generate thousands of different, but virtually identical pages.
maintainer: This is the string used to identify the crawler. Change this to your email address.
Most of the other values are either parameters that you can tweak after getting familiar with HTDig, or they’re variables that control the HTDig user interface.
After you’ve edited the configuration file, it’s time to start crawling and indexing. The rundig wrapper script does just that. It takes care of invoking htdig, htmerge, htnotify, and then htfuzzy.
The first run can take quite some time if you have a lot of data. It normally produces no output. The spidering processing is network-intensive, while the merging process can be CPU- and disk-intensive. So, if you have a large amount of data to index, it may take quite a while and use a lot of disk space.
Indexing the Linux Magazine site took roughly twenty minutes with a fast network connection. The resulting index used approximately 60 MB of disk space. The entire Linux Magazine site contains 200 MB of data, but a good portion of that is downloads and images.
After the index’s been built, it’s time to take it for a test drive. The first thing to do is try htsearch at the command line. To do so, run the htsearch program. It’ll prompt you for words and a format. To test the Linux Magazine index, let’s search for “install” and leave the format blank — it’s not terribly useful on the command-line.
Enter value for words: install
Enter value for format:
htsearch should spit out a bunch of search results embedded in HTML. If that’s not what you see, double check the installation and the values in your configuration file.
Figure One: An htdig search using the default interface
If that works, you should be able to point your browser at htsearch and run a test search. The results will look something like Figure One.
The only other back-end task to worry about is keeping the index up-to-date. The easiest way to accomplish that is to setup periodic execution of rundig using something like the cron daemon. A crontab entry like this…
00 07 * * 6 /htdig/bin/rundig
…ensures that the index is updated every Saturday at 7:00 am.
With all the mechanics in place, you can turn to the user interface. There are two tasks to contend with. The first is integrating the search into an existing site. Then you can look at customizing the results page.
By looking at the HTML source code for the search page, you can extract the minimal elements required to add a simple search box on your site:
Yes, it’s that easy. In fact, you can leave off the submit button if you’d like. It’ll save some space if you already have a rather tight layout. Of course, you may need to add some <div> tags and the right CSS to make it look just right on your site.
To control what the results page looks like, there are two places to change. The first is the htdig.conf file. It contains several variables that have names ending with _text. Those variables control the appearance of the navigation among pages of results. You can replace the default images with your own or maybe just use text links.
To control the larger aspects of the results page, you’ll need to modify the various HTML files in HTDig’s common directory, primarily header.html and footer.html. Those two files make up the look and feel of the results page. Similarly, the long.html and short.html files control the appearance of the long and short result formats, respectively. There’s also a nomatch.html for queries that produce no results and a syntax.html page that provides search syntax help.
By adjusting the colors, adding your site’s logo, and making a few other changes, you can easily customize HTDig’s output so that it feels like an integral part of your site.
All in all, HTDig is a surprisingly powerful search engine for small- and medium-sized web sites. Be sure to check out the on-line documentation for information about performance tuning and advanced features.
Until next month, happy searching!
Jeremy Zawodny plays with MySQL by day and spends his spare time flying gliders in California and writing a MySQL book for O’Reilly & Associates. You can reach Jeremy via email at Jeremy@Zawodny.com.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62