In a Nutshell
* Scans multiple Web sites
* Flexible search engine
* Installation requires building
* Configuration steps required
|Find it Fast: ht://Dig in action on an intranet site.|
Because this is free software, there are no official minimum system requirements. However, Linux Magazine recommends:
* Intel 486 MHz or higher
* 8 MB RAM
* 100 MB Disk Space (allow 12K per document indexed)
* Linux Kernel version 2.0 or higher
* The C and C++ compilers*
* GNU Make*
* The libstdc++ libraries
* These tools are only needed to build it, since the author does not provide pre-built binaries for several platforms, including glibc Linux.
It may not get a lot of respect, but that little search box you see on most Web sites is one of the killer apps of the Internet. And in the case of sites like Yahoo, Google, and AltaVista, it has become the cornerstone of billion-dollar companies. Well, you may not be able to turn your Web site into the next Yahoo, but if you’re looking for a great way to index all the content on your own Web site, Andrew Scherpbier’s ht:// Dig may be for you.
ht://Dig’s Web interface is easy to use, although installation and configuration are a bit more demanding. Its price (free) is certainly right, which helps explain why it has already found a home on sites run by universities and businesses.
Digging into ht://Dig
ht://Dig consists of three parts: A backend search and indexing engine that scans the Web pages and stores the information in the database, and HTML files that provide end-user access to the last component, the search engine.
Getting ht://Dig up and running wasn’t difficult. We downloaded the source package and built the program using the standard C++ tools found in many major Linux distributions. The build process also creates an installation script that installs ht://Dig for use with Apache. We tested the program on Debian, Red Hat, and Caldera, each time with success. According to the ht://Dig Web site, the package also works with Sun Solaris, HP/UX, and IRIX.
Before running the backend script, called rundig, we had to edit the htdig.conf file to include the names or IP addresses of the Web servers to be scanned. ht://Dig will not scan FTP sites, but it can scan any local or remote server, and it can also use basic authentication with a specified username and password.
The rundig script nicely automates the update process by scanning the specified Web sites and storing the information in the ht://Dig database. It can even check for expired pages and send e-mail notifications when it finds them. (The expiration date and e-mail address are set in the HTML pages using META tags.) The program also creates indices for fuzzy search methods, including soundex, metaphone, endings, and synonyms, any of which can be used in a search.
The size of the database is approximately 7.5 KB per document without a keyword list and 12 KB per document with one. This information includes excerpts from the document that can provide information about it even when it is inaccessible, such as when the Web server is down. The initial portion or all of the document can optionally be saved as well.
As you’d expect, the time needed to complete a backend scan depends on the number of Web pages you want to index, the speed of the connection between the ht://Dig server and the Web site, and the number of fuzzy search indices that must be built. Scans can last from a minute or so to hours for sufficiently large and complex sites. The payback for all this is the quick database search, which takes just seconds.
ht://Dig adds a new dimension of usability and professionalism to almost any Web site, and at an unbeatable price. That means that virtually every Web site running on a Linux or Unix server can and should be indexed.