dcsimg

Sphinx: Queries and APIs

Now it's time to get serious and look at writing some simple code that can query a running Sphinx index and take advantage of its advanced query features.

Two weeks ago, in Sphinx: Search Outside the Box I took a high-level look at why Sphinx is a great choice for full text indexing large data sets. Last week, in Sphinx: Getting Practical we dove into setting up Sphinx and building a simple index and querying it from the command line.

Now it’s time to get serious and look at writing some simple code that can query a running Sphinx index and take advantage of its growing number of advanced query features. The Sphinx Documentation is obviously the definitive reference, but I hope to show just enough sample code that you realize how easy it is to start talking to a Sphinx server.

API Basics

There are Sphinx clients available in most popular languages. If you look in the api subdirectory of the source tree, you’ll find Ruby, Java, PHP, Python, and C (libsphinxclient). There’s a Perl module available (Sphinx-Search on CPAN) too. And if none of those are sufficient, the latest versions of Sphinx even support SQL-like queries issued via the MySQL protocol (on TCP port 3306, just like MySQL). Talk about an easy migration path from MySQL full-text!

Obviously the syntax in the various languages, differs, but the general approach for querying Sphinx is similar in all of them.

  1. create a sphinx client object
  2. set query options
  3. set query
  4. connect to sphinx server (if not connected)
  5. send query
  6. receive results
  7. close connection

Note that it’s possible to batch queries and send several at once. Doing so allows Sphinx to perform more efficiently if some duplicate work can be done only once. However, that’s not often needed in traditional web deployments, but it can be useful in offline processing. Also, newer Sphinx releases have support for persistent connections. Not only do they reduce the fork() overhead (which can be substantial!) on the server side, they also reduce the TCP overhead and allow for higher throughput in high volume situations. As a result, step 4 and step 7 may not always apply.

Let’s look a simple PHP code example that connects to the Sphinx server running on localhost, searches for all documents that contains the phrase “hello world”, and sorts them by size.

require("sphinxapi.php");

$host = "localhost";
$port = 3306;
$index = "index1"; // put your index name here
$query = "hello world";

$cl = new SphinxClient();
$cl->SetServer($host, $port);
$cl->SetMatchMode(SPH_MATCH_PHRASE);

$res = $cl->Query($query, $index);

Once the results come back, you simply check for success and matches, printing out the document ids.

if ($res == false) {
    print "failure: " $cl->GetLastError() . "\n";
}
else {
    print "retrieved $res[total] of $res[total_found] matches in $res[time] seconds\n";

    foreach ($res["matches"] as $docinfo) {
        print "$docinfo[id]\n";
    }
}

That code makes use of the single file sphinxapi.php which is the PHP client API that’s shipped as part of every Sphinx release. In fact, the test suite used to validate new releases uses the PHP API heavily, so you can probably find example code to do just about anything you’d need.

As you can see, it follows the process outlined above. After a few variables are defined, we create a new Sphinx client object ($cl), set a few options, and then fire off the query. Iterating over the results is also very straightforward. The example above is intentionally short — it’s actually possible to retrieve some metadata (namely, the attributes) for each of the matched documents in the result set too.

Building on that simple foundation, there’s a lot more we can do.

Matching Modes

In the example code we used a call to SetMatchMode(), passing SPH_MATCH_PHRASE. That told Sphinx we wanted a phrase match–that is find “hello” and “world” used together. There are several other matching modes availble.

  • SPH_MATCH_ALL: find documents that contain all of the query terms
  • SPH_MATCH_ANY: find documents that contain any of the query terms
  • SPH_MATCH_BOOLEAN: allow AND (&), OR (|), and negation (-term) expressions plus grouping using parenthesis
  • SPH_MATCH_EXTENDED2: support queries using Sphinx’s more complex query language
  • SPH_MATCH_FULLSCAN: search all documents, applying any specified filters and grouping

Between Boolean and extended2 (which replaces the original “extended” mode), you can construct queries complex enough for just about any circumstance.

Sorting Modes

Sphinx allows you to choose from several sorting modes that affect the order in which results are returned but not which documents match the query.

  • SPH_SORT_RELEVANCE: Sphinx default, sort from most relevant to least based on word frequency
  • SPH_SORT_ATTR_ASC: sort in ascending order based on an attribute
  • SPH_SORT_ATTR_DESC: sort in descending order based on an attribute
  • SPH_SORT_TIME_SEGMENTS: group by “time segment”, then sort by relevance within the groups
  • SPH_SORT_EXTENDED: configure sorting based on multiple attributes, each of which can be in ascending or descending order
  • SPH_SORT_EXPR: sort based on an arbitrary mathematical expression

To make this more concrete, consider this call:

$cl->SetSortMode(SPH_SORT_ATTR_DESC, "size");

That asks Sphinx to sort the documents from largest to smallest (based on the size attribute included in the earlier index definition).

In “extended” mode, you can use the attributes defined for your index as well as some of Sphinx’s internal attributes as well.

$cl->SetSortMode(SPH_SORT_EXTENDED, "size ASC, @id DESC");

That tells Sphinx to sort in ascending order by size and then in descending order by document id in the case of a tie. Extended mode is very powerful–especially if you have numerous attributes on which to sort (price, weight, date added, etc.).

Filtering

In addition to full-text search capabilities, Sphinx lets you use numeric attributes to refine a search. For example, in building a product search, you may want to find all products whose price is less than $500. Or maybe find all those that fall between $50 and $75. To do this, you’ll want to call SetFilter(), SetFilterRange(), or SetFloatFilterRange(). All three filtering functions allow you to specify either an inclusive or exclusive filter.

Using SetFilter() can find documents whose attributes match one or more values, or exclude those documents that match one or more values.

$cl->SetFilter("price", array(100), 0); // find $100 items
$cl->SetFilter("price", array(50, 75), 1; // exclude $50 and $75 items

Similarly, we can use SetFilterRange() to find or exclude a range of integer values (use SetFloatFilterRange() for non-integer values).

$cl->SetFilterRange("price", 50, 100, 0); // find items priced between $50 and $100
$cl->SetFilterRange("price", 50, 100, 1); // exclude items priced between $50 and $100

Between filters on attributes and the extended query language, you can handle a surprising array of query types without having to write a lot of custom code.

Geography

A special case of filtering and sorting based on attributes is geo-distance. If you have geocoded data, such as houses for sale or the locations of restaurants, you can add latitude and longitude attributes to your index and take advange of Sphinx’s built-in support. In the SPH_SORT_EXPR sorting mode, you can use the built-in GEODIST() function to compute the distance between two points of latitude and longitude. But it’s easier to use the SetGeoAnchor() call to tell Sphinx what the latitude and longitude attributes are called in your index and specify an “anchor” point from which distances will be computed.

$cl->SetGeoAnchor("lat", "lon", $latitude, $longitude);

Once that is done, you can use the magic attribute @geodist in both filters and sorting. That would allow you to, say, find all pizza places within a 5 mile radius of a given point and then sort the result set based on that distance.

Conclusion

Hopefully this has provided you with some ideas for the types of tweaking you can do behind the scenes to make Sphinx search just the way you expect (and need) it to. In addition to everything we’ve seen so far, Sphinx can also perform more complex grouping of results and it can also build “excerpts” of matched documents on the fly to show context (much like Google does). As always, it’s best to check the documentation for complete descriptions of the options as well as any gotchas or hints.

Happy searching!

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62