Adding Search to Your Site, Part 2

Last month, we looked at adding search to a site using the open source ht://Dig search tools. As you'll recall, ht://Dig handles the crawling, indexing, and search duties. However, not everyone has the access or resources required to install ht://Dig, so this month we'll try an alternative approach -- using Google from PHP.

Last month, we looked at adding search to a site using the open source ht://Dig search tools. As you’ll recall, ht://Dig handles the crawling, indexing, and search duties. However, not everyone has the access or resources required to install ht://Dig, so this month we’ll try an alternative approach — using Google from PHP.

It’s the API, Stupid

Google is built on one of the world’s largest deployments of Linux servers (somewhere around 10,000 machines). Not only is Google an amazingly powerful Web search engine, it’s also available as a web service. That means you can easily write code to query Google and manipulate the results programatically, without having to resort to “screen scraping” tricks. (For more about web services, see the August 2002 issue, available online at http://www.linux-mag.com/2002-08/web_services_01.html.)

Since its public release, the Google web service has been put to a variety of strange and amusing uses. GoogleFight (http://www.googlefight.com) is one of the most entertaining. Here, our goal is to produce a Google clone that searches only your web site. To do that, we’ll need to develop the necessary PHP code to handle a simple form (a search box), query Google, and produce a list of matching documents.

To get started, there’s a bit of software and information that you’ll need to collect. First, pay a visit to Google’s API page (http://www.google.com/apis), download the developer’s kit, and create a Google account. The developer’s kit contains an API reference as well as sample code for Java, VisualBasic. NET, and C#. No PHP. However, with the API reference in hand, it’s not that difficult to get going. As part of the account creation process, you’ll receive a Google API key (a string of 30 characters or so) that must be included in every request you make.

The PHPieces

Once you have the Google-supplied code and a key, head over to PEAR (http://pear.php.net) and fetch all of the following packages:

PEAR SOAP (http://pear.php.net/package-info.php?pacid=87)

HTTP_Request (http://pear.php.net/package-info.php?pacid=33)

Net_URL (http://pear.php.net/package-info.php?pacid=34)

Net_DIME (http://pear.php.net/package-info.php?pacid=86)

Once you’ve retrieved those, install each one into your local PEAR directory. If you’re not familiar with PEAR installation, see Chapter 2 of the PEAR manual (http://pear.php.net/manual/en/installation.php).

It’s important to install all of the components. Without all of them, your code may fail in mysterious ways, causing consternation, frustration, and a lot of lost time debugging.

Now Get Coding

To get a feel for what’s possible, let’s start with a PHP script that queries Google and dumps a bunch of data to standard output. Using the code in Listing One, you can query Google for an arbitrary term and look at the raw results.

Starting at the top of Listing One, we include the SOAP client code library. Then we supply the query string (including a site: restriction so Google will only return documents hosted on our site) and the key you received from Google after registering. Then the fun begins.




Listing One: How to call the Google search web service

<?php

include(“SOAP/Client.php”);

// What we’re searching for…
$query = ‘linus torvalds site:www.linux-mag.com’;

// google license key — http://www.google.com/apis/
$key = ‘xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx’;

$s = new SOAP_Client(‘http://api.google.com/search/beta2‘);
$result = $s->call(‘doGoogleSearch’, $params = array(
‘key’ => $key,
‘q’ => $query,
‘start’ => 0,
‘maxResults’ => 10,
‘filter’ => false,
‘restrict’ => ”,
‘safeSearch’ => false,
‘lr’ => ”,
‘ie’ => ”,
‘oe’ => ”,
), ‘urn:GoogleSearch’);

print_r($result);
?>

The bulk of the code is two SOAP related calls. First, we instantiate a SOAP Client object (called $s) and then call its call() method. In doing so, we’re making a SOAP call to Google to invoke the doGoogleSearch method with a list of parameters and a namespace qualifier thrown in at the end. (The API Reference details the meaning of each parameter.)

When the result comes back, we use PHP’s print_r() to dump out the object structure. It looks something like this:


$ php search.php

X-Powered-By: PHP/4.1.2
Content-type: text/html

stdClass Object
(
[documentFiltering] =>
[estimatedTotalResultsCount] => 171
[directoryCategories] =>
[searchTime] => 0.029243
[resultElements] => Array
(
[0] => stdClass Object
(
[cachedSize] => 18k
[hostName] =>
[snippet] => <b>…</b>
FEATURES The Great…
[directoryCategory] =>
stdClass Object
(
[specialEncoding] =>
[fullViewableName] =>
)

[relatedInformationPresent] => 1
[directoryTitle] =>
[summary] =>
[URL] =>
http://www.linux-mag.com/2002-12/linus_01.html
[title] => Linux Magazine |
December 2002 | FEATURES…
)

The code lists nine more elements, because maxResults was set to 10 in the request.

There’s a bit of interesting stuff at the beginning, notably the estimatedTotalResultsCount and searchTime, but what we’re really interested in is the resultElements array. It contains a list of objects for each of the matching documents. As you can see, we get the cached size, snippet, URL, title, and occasionally a lot more.

Make it Pretty

Now that we know how the data is structured, it’s relatively easy to walk thru the results and produce HTML output suitable for a search results page. The code in Listing Two does just that.




Listing Two: Converting the raw data to HTML

$num = $result->{estimatedTotalResultsCount};
$elements = $result->{resultElements};
$out = ”;
if ($num > 0)
{
foreach ($elements as $item)
{
$size = $item->{cachedSize};
$title = $item->{title};
$url = $item->{URL};
$snippet = $item->{snippet};

$out = “<p><b>$title</b> – <a href=\”$url\”>$url</a> “;
$out .= “<small>[Size: $size]</small></p>”;
$out .= “\n<blockquote>$snippet</blockquote>\n\n”;
$out .= $desc;
}
}

echo $out;

We retrieve the estimated number of matches and then a list of the result objects. The code then iterates over the list, extracting the relevant properties to produce a string of HTML output.

If you append the code in Listing Two to Listing One and remove the print_r() call, you’ll have a script that can produce HTML results from a Google query. Then all that’s left to do is add a header and footer to the script, maybe toss in a style sheet reference, and get it to accept query strings as part of the URL. With just a little bit of code, you’ve got your own site-specific search engine, thanks to the Google API.

Listing Three shows complete, yet minimalistic code, including a search box so you can easily re-query. You’ll need to add your own artistic touch, however, to the otherwise plain output.




Listing Three: A PHP recipe to provide search on your web site

<?php
$terms = $HTTP_GET_VARS["q"];
?>
<html>
<head>
<title>Search Results for <?php echo htmlspecialchars($term)?></title>
</head>
<body>
<form method=”get”>
<p>Search again:<br />
<input name=”q”>
</p>
<?php

include(“SOAP/Client.php”);

// What we’re searching for…
$query = “$terms site:www.linux-mag.com”;

// google license key — http://www.google.com/apis/
$key = ‘xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx’;

$s = new SOAP_Client(‘http://api.google.com/search/beta2‘);
$result = $s->call(‘doGoogleSearch’, $params = array(
‘key’ => $key,
‘q’ => $query,
‘start’ => 0,
‘maxResults’ => 10,
‘filter’ => false,
‘restrict’ => ”,
‘safeSearch’ => false,
‘lr’ => ”,
‘ie’ => ”,
‘oe’ => ”,
), ‘urn:GoogleSearch’);
# print_r($result);

// Is result a PEAR_Error?
if (get_class($result) == ‘pear_error’)
{
$message = $result->message;
$output = “An error occured: $message<p>”;
}
else
{
// No error
$num = $result->{estimatedTotalResultsCount};
$elements = $result->{resultElements};
$list = ”;
if ($num > 0)
{
foreach ($elements as $item)
{
$size = $item->{cachedSize};
$title = $item->{title};
$url = $item->{URL};
$snippet = $item->{snippet};

$desc = “<p><b>$title</b> – <a href=\”$url\”>$url</a> “;
$desc .= “<small>[Size: $size]</small></p>”;
$desc .= “\n<blockquote>$snippet</blockquote>\n\n”;
$list .= $desc;
}
}
$output = “$num results found:\n\n$list”;
}
echo $output;
?>
</body>
</html>

But That’s Not All!

It turns out that search isn’t the only feature exposed in the Google API. By changing the method call and using different parameters, you can retrieve spelling suggestions, too. Google’s not the only company to offer their services via a SOAP interface, but they’re probably the most widely deployed.

Anyway, rather than fussing with search software running on your server, consider letting Google do all of the hard work for you. Just be sure to read the license terms when you sign up for your account. By spending a few minutes in the PHP code, you can adjust the appearance of the results in any way that you’d like.

Happy Googling!



Jeremy Zawodny plays with MySQL by day and spends his spare time flying gliders in California and writing a MySQL book for O’Reilly & Associates. Reach him at Jeremy@Zawodny.com. You can download the code used in this column from http://www.linux-mag.com/downloads/2003-07/lamp.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62