dcsimg

Sphinx: Search Outside the Box

Looking for ways to overcome indexing bottlenecks at Craigslist lead to an investigation of Sphinx, a powerful, free full-text search engine that works extremely well with MySQL.

For many years, the only easy way to add decent search capability to a MySQL-backed web site was to use its full-text index support. It was fast, efficient, and reasonably configurable. But as more and more sites began to deal with larger datasets and moved from MyISAM to InnoDB, they found that it was harder to support their search needs. InnoDB does not provide full-text indexing, so that often meant keeping around a set of slaves that still ran MyISAM for the sole purpose of handling full-text search requests.

That’s exactly what we were doing at Craigslist before I began looking for a better way. My search for alternatives wasn’t motivated so much by the need to maintain the MyISAM slaves–that part was automated. The real problem is that we kept running into an invisible “glass ceiling” in performance. Low concurrency on the full-text searches eventually meant throwing far more hardware at the problem than was really necessary. Response times would go through the roof, CPU was not being overly taxed, disks were mostly idle, and there was still free memory.

As many DBAs have discovered, MyISAM does not scale well in a multi-core world. Even the mulitple key caches, which were added to partially addess this problem, are a fairly corase work around for the problem.

Enter Sphinx

Sphinx was a project born out of the need for something better than MySQL’s full-text search for sites already using MySQL. It’s available both as a MySQL storage engine and as a stand-alone daemon. There are pros and cons to both deployment methods, but I’m going to focus mainly on the stand-alone daemon for this article. Not only is it what I’m most familiar with, I believe it to be far more powerful and flexible.

In fact, our migration from MySQL full-text to Sphinx has made a world of difference. We’re able to handle significantly higher query volumes with fewer machines, smoother scaling, and have a lot more features to take advantage of. So, what makes Sphinx so good? For the next few weeks, I’ll dig into Sphinx to help you understand what makes it so powerful, how it works, and how to put it to work.

Architecture

Sphinx is designed to scale horizontally and take advantage of all available CPU power if needed. When Sphinx is started, a single searchd process optionally pre-loads some index data structures, binds to TCP port 3312 (the Sphinx default) and awaits connections. When a connection comes in, searchd forks a child to handle the request and goes back to waiting for new connections. The child takes the request, performs the search, and returns the results. If the client asked for a persistent connection, the searchd child awaits the next request. Otherwise it terminates.

This is the sort of “shared nothing” architecture you see in high-volume Apache/PHP or Apache/mod_perl applications. There’s no locking among the searchd processes, so it’s trivial to scale the number of concurrent requests you can process by making more CPU cores available. This design also means that a fatal bug in searchd will only affect the current process (and request) without others noticing. In other words, it’s a model that provides for excellent isolation too.

If you need to scale beyond a single machine, either to due to query volume or data size, Sphinx makes that easy using distributed indexes. You can add some trivial horizontal partitioning in your application to spread the indexes among Sphinx servers hosted on multiple machines which are the referred to as “agents.” Then you configure a sphinx instance that knows about all of the agents. That sphinx instance is what you’ll query from your application. It will, in turn, send the query to the individual agents in parallel, wait for the results to come back, combine them, and return them to you. From the application point of view, you don’t need to know if the indexes are located on a single machine or a dozen remote agents.

Of course, you can use your own vertical partitioning scheme too.

Performance

When I describe Sphinx performance, I often feel like I’m back in 2001 talking about MySQL. Sphinx is very very fast. In fact, I think most people will be shocked by how well it performs–not just in handling searches, but also when building or re-building its indexes. Part of the reason for this is Andrew’s relentless focus on performance. The Sphinx core is very efficient C++ that results in small, CPU-cache friendly binaries.

Sphinx works to pack as much data as possible into a small space, making every bit count. In fact, you can control the exact number of bits used to represent attributes of the documents that you index.

Another reason that Sphinx is so fast has to do with philosophy. Sphinx is fairly minimalistic in its approach. While it is configurable, it does not have a bewildering set of configuration options and dynamic on-the-fly options. Sphinx encourages you to sit down, define what you need, configure and test, and then deploy a highly efficient solution.

Under the Hood

At a high-level view, Sphinx provides very fast full-text search across multiple fields with custom ranking and filtering of the result set based on attributes. It supports common boolearn operators, proximity, wildcards, stemming, custom character sets, sorting, and more. But what does all that really mean?

Let’s say you’re building a Sphinx index of your email archive. You’d probably define an index with several text fields, just like you’d have columns in database: subject, body, to, from, and cc. Then you’d define a number of attributes as well–items you wish to use in filtering and/or sorting but that you don’t need to perform text searches on: date, size, attachment_count, etc. Then you’d feed all the documents (messages) to the Sphinx indexer, which takes care of breaking them into tokens and attributes, building the indexes, and any stemming, filtering, or transformation you need done.

By default, you might build a simple Google-like interface to your email archive. Type in a term or two like “finance spreadsheet”, hit the search button, and get back a list of highly relevant results. To do that behind the scenes, you’d issue an ANY query across all the fields (subject, body, to, from, and cc). Sphinx would find all of documents (messages) that match any of those terms in any of those fields, weight and sort them, and return the list of document ids (probably message ids in this example).

You have the flexibility to ask Sphinx to do the matching, weighting, sorting, and grouping in a number of different ways. The defaults are often quite sufficient for many applications

Final Thoughts

Next week I’ll walk through a simple Sphinx setup and demonstrate some features of the client API as well.

If you’re looking for a high-performance, stable, and very functional alternative to MySQL’s full-text search, give Sphinx a look. Odds are that it can do virtually everything you need and use fewer resources in the process.

Sphinx isn’t the only game in town. Apache Solr packages up the venerable Lucene search system into a very usable and highly configurable package that can run under a variety of application servers. Solr is also quite fast and is expendable via custom Java code. It has much better out of the box support for what’s known as “faceted search”–the sort of drill down into category and features you see at popular shopping web sites like NewEgg.The only downsides I saw were a steeper learning curve (mainly complexity and configuration) and the need for a bit more infrastructure if you’re not already a Java shop. Solr is definitely worth looking at.

Comments on "Sphinx: Search Outside the Box"

paulsterne

Mailspect, the developer of MPP Email Archive, is a big fan of Sphinx. It is the open source component that we use to achieve ultra-fast search and retrieval of archived emails. We love it.

alexksi

If you are looking to perform faceted search with Sphinx, fSphinx is pretty cool:

https://github.com/alexksikes/fSphinx

Im obliged for the blog post.Thanks Again. Fantastic.

Great, thanks for sharing this blog.Much thanks again. Great.

Hey! Someone in my Facebook group shared this website with us so I came to look it over. I’m definitely enjoying the information. I’m book-marking and will be tweeting this to my followers! Superb blog and fantastic style and design.

You really make it appear really easy with your presentation however I in finding this topic to be really something that I believe I’d by no means understand. It seems too complex and extremely extensive for me. I’m having a look ahead to your subsequent put up, I will try to get the hang of it!

This blog was how do I say it? Relevant!! Finally I ave found something which helped me. Thanks!

Sites of interest we’ve a link to.

We came across a cool site that you simply could possibly enjoy. Take a search when you want.

HO7uUk Valuable info. Lucky me I found your web site by chance, and I am surprised why this coincidence did not happened earlier! I bookmarked it.

We like to honor a lot of other world wide web websites around the net, even though they aren?t linked to us, by linking to them. Beneath are some webpages really worth checking out.

Just beneath, are a lot of completely not related web sites to ours, however, they are surely really worth going over.

That is the finish of this article. Here you will discover some sites that we assume you?ll value, just click the hyperlinks.

Here are a few of the web-sites we advocate for our visitors.

I think this is a real great post.Thanks Again. Will read on…

Nice post. I learn one thing more difficult on totally different blogs everyday. It should at all times be stimulating to learn content from different writers and practice a bit of something from their store. I’d choose to use some with the content material on my weblog whether or not you don’t mind. Natually I’ll provide you with a link on your internet blog. Thanks for sharing.

Just beneath, are many absolutely not connected sites to ours, nonetheless, they are certainly worth going over.

“Very good blog article.Thanks Again. Really Cool.”

gJHlus ipakdvzahdsl, [url=http://wyoadrnhgfcc.com/]wyoadrnhgfcc[/url], [link=http://quibvttudnkc.com/]quibvttudnkc[/link], http://cfpwijhgnxip.com/

Your mode of describing all in this post is in fact fastidious, every
one can easily be aware of it, Thanks a lot.

Hello baby
Your article is very helpful for me,i like it,thanks!
mens replica hermes belts

Here are some links to web-sites that we link to mainly because we think they are worth visiting.

Every after inside a whilst we select blogs that we read. Listed below would be the latest web pages that we decide on.

Usually posts some quite interesting stuff like this. If you are new to this site.

Very few web sites that take place to be detailed beneath, from our point of view are undoubtedly effectively worth checking out.

We came across a cool internet site that you just could possibly take pleasure in. Take a search in case you want.

Always a massive fan of linking to bloggers that I like but do not get a great deal of link love from.

Hi! I could have sworn I’ve been to this site before but after browsing through a number of the post I realized
it’s a new comer to me. Nonetheless, I’m definitely glad I discovered it and I’ll be bookmarking and
checking back often!

Here is my weblog – TrangOHermes

Sites of interest we have a link to.

My developer is wanting to persuade me to maneuver to .net from PHP.
I have always disliked the thought because of
the expenses. But he’s tryiong none the less. I’ve been utilizing
Movable-type on various websites for about annually and am
concerned about switching to another one platform.

I actually have heard good aspects of blogengine.net.
Is there a way I will import my wordpress content into it? Just about any help could be really appreciated!

Here is my page; BettyeDGoold

Here are some links to internet sites that we link to mainly because we assume they’re worth visiting.

The information and facts talked about in the report are some of the top obtainable.

Howdy! Do you know if they make any plugins to safeguard against hackers?
I’m kinda paranoid about losing everything I’ve worked
hard on. Any tips?

Take a look at my weblog … ArnoldKBatte

Here are a few of the web pages we suggest for our visitors.

I got this web page from my friend who told me concerning this web page
and now this time I am browsing this web page and reading very informative content here.

Feel free to surf to my weblog – JerryFShed

Always a large fan of linking to bloggers that I like but really don’t get a lot of link really like from.

Although websites we backlink to beneath are considerably not related to ours, we feel they’re essentially really worth a go through, so have a look.

Always a huge fan of linking to bloggers that I like but do not get a great deal of link adore from.

The info mentioned in the article are a few of the top available.

The facts talked about in the write-up are a number of the most effective offered.

Leave a Reply