Last month, we looked at adding search to a site using the open source ht://Dig search tools. As you'll recall, ht://Dig handles the crawling, indexing, and search duties. However, not everyone has the access or resources required to install ht://Dig, so this month we'll try an alternative approach -- using Google from PHP.
Last month, we looked at adding search to a site using the open source ht://Dig search tools. As you’ll recall, ht://Dig handles the crawling, indexing, and search duties. However, not everyone has the access or resources required to install ht://Dig, so this month we’ll try an alternative approach — using Google from PHP.
It’s the API, Stupid
Google is built on one of the world’s largest deployments of Linux servers (somewhere around 10,000 machines). Not only is Google an amazingly powerful Web search engine, it’s also available as a web service. That means you can easily write code to query Google and manipulate the results programatically, without having to resort to “screen scraping” tricks. (For more about web services, see the August 2002 issue, available online at http://www.linux-mag.com/2002-08/web_services_01.html.)
Since its public release, the Google web service has been put to a variety of strange and amusing uses. GoogleFight (http://www.googlefight.com) is one of the most entertaining. Here, our goal is to produce a Google clone that searches only your web site. To do that, we’ll need to develop the necessary PHP code to handle a simple form (a search box), query Google, and produce a list of matching documents.
To get started, there’s a bit of software and information that you’ll need to collect. First, pay a visit to Google’s API page (http://www.google.com/apis), download the developer’s kit, and create a Google account. The developer’s kit contains an API reference as well as sample code for Java, VisualBasic. NET, and C#. No PHP. However, with the API reference in hand, it’s not that difficult to get going. As part of the account creation process, you’ll receive a Google API key (a string of 30 characters or so) that must be included in every request you make.
Once you have the Google-supplied code and a key, head over to PEAR (http://pear.php.net) and fetch all of the following packages:
It’s important to install all of the components. Without all of them, your code may fail in mysterious ways, causing consternation, frustration, and a lot of lost time debugging.
Now Get Coding
To get a feel for what’s possible, let’s start with a PHP script that queries Google and dumps a bunch of data to standard output. Using the code in Listing One, you can query Google for an arbitrary term and look at the raw results.
Starting at the top of Listing One, we include the SOAP client code library. Then we supply the query string (including a site: restriction so Google will only return documents hosted on our site) and the key you received from Google after registering. Then the fun begins.
Listing One: How to call the Google search web service
// What we’re searching for…
$query = ‘linus torvalds site:www.linux-mag.com’;
The bulk of the code is two SOAP related calls. First, we instantiate a SOAP Client object (called $s) and then call its call() method. In doing so, we’re making a SOAP call to Google to invoke the doGoogleSearch method with a list of parameters and a namespace qualifier thrown in at the end. (The API Reference details the meaning of each parameter.)
When the result comes back, we use PHP’s print_r() to dump out the object structure. It looks something like this:
The code lists nine more elements, because maxResults was set to 10 in the request.
There’s a bit of interesting stuff at the beginning, notably the estimatedTotalResultsCount and searchTime, but what we’re really interested in is the resultElements array. It contains a list of objects for each of the matching documents. As you can see, we get the cached size, snippet, URL, title, and occasionally a lot more.
Make it Pretty
Now that we know how the data is structured, it’s relatively easy to walk thru the results and produce HTML output suitable for a search results page. The code in Listing Two does just that.
We retrieve the estimated number of matches and then a list of the result objects. The code then iterates over the list, extracting the relevant properties to produce a string of HTML output.
If you append the code in Listing Two to Listing One and remove the print_r() call, you’ll have a script that can produce HTML results from a Google query. Then all that’s left to do is add a header and footer to the script, maybe toss in a style sheet reference, and get it to accept query strings as part of the URL. With just a little bit of code, you’ve got your own site-specific search engine, thanks to the Google API.
Listing Three shows complete, yet minimalistic code, including a search box so you can easily re-query. You’ll need to add your own artistic touch, however, to the otherwise plain output.
Listing Three: A PHP recipe to provide search on your web site
It turns out that search isn’t the only feature exposed in the Google API. By changing the method call and using different parameters, you can retrieve spelling suggestions, too. Google’s not the only company to offer their services via a SOAP interface, but they’re probably the most widely deployed.
Anyway, rather than fussing with search software running on your server, consider letting Google do all of the hard work for you. Just be sure to read the license terms when you sign up for your account. By spending a few minutes in the PHP code, you can adjust the appearance of the results in any way that you’d like.
Jeremy Zawodny plays with MySQL by day and spends his spare time flying gliders in California and writing a MySQL book for O’Reilly & Associates. Reach him at Jeremy@Zawodny.com. You can download the code used in this column from http://www.linux-mag.com/downloads/2003-07/lamp.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62