Alternatives to full text queries (part II)
For part one, click here…
What do they have in common and what makes them different?
Even though it’s hard to come up with a comparison table between all four alternatives, mainly because I can’t claim to have personal experience with all of them, the Internet has a lot of information on the subject, so I went ahead and did a bit of research on the matter.
Another point of interest to consider is that though on the long run, all four solutions provide very similar services; they do it a bit differently, since they can be categorized into two places:
- Full text search servers: They provide a finished solution, ready for the developers to install and interact with. You don’t have to integrate them into your application; you only have to interact with them. In here we have Solr and Sphinx.
- Full text search APIs: They provide the functionalities needed by the developer, but at a lower level. You’ll need to integrate these APIs into your application, instead of just consuming it’s services through a standard interface (like what happens with the servers). In here, we have the Lucene Project and the Xapian project.
Taking all of this into account, we can now proceed into a more in-depth discussion about our options:
Full text search servers
Like I’ve discussed before, search servers provide an end solution, ready to be installed, to be tweaked into your needs and to be interfaced with.
The installing and tweaking process can be hard, depending on your specific needs, but the bright side is that all of it is done outside your application and your code. Once you’re done and the server is installed according to your needs, the only step left is interacting with it from your application using one of the methods provided by the server.
So, what do they have in common?
- They’ll both satisfy your needs regarding searching and indexing speed, since they do it very efficiently.
- They both have a long list of high-traffic sites using them, as we’ve seen above.
- Both offer commercial support, which is great if you’re planning on developing a commercial application to use them.
- Both offer client API bindings for several platforms/languages.
- Both can be distributed to increase speed and capacity.
- They both have great support for advanced querying. This allows them to use proximity search, relevance sorting and so on.
- Solr is an Apache Project built on top of the Lucene Project, allowing it to improve some of it’s features whenever the Lucene projects updates. Sphinx is an isolated project, which requires it to improve independently.
- Solr supports the use of wildcards for it’s searches, while the current release version of
Sphinx (0.9) does not. The latest beta for Sphinx (2.0) though, appears to be supporting this feature.
- Solr comes with spell check for search terms out of the box, while Sphinx does not.
- Solr can parse rich text formats like Word documents, PDF files and so on; Sphinx does not provide this feature out of the box.
- Sphinx integrates more tightly with RDBMSs, especially MySQL since it was built with this functionality in mind.
- Sphinx can use the stopwords for a more relevant result ranking by default, whether Solr extracts them before doing the search.
- Sphinx supports SQL as a query language, when Solr forces you to learn it’s own language.
- Sphinx does not support multithreading on Windows machines, which slows down it’s performance on this OS.
So, which one is better?
As you might have guessed, there is straight answer to this question, since they both do a similar job, but have different strong and weak points. This allows them to be better than the other for some specific cases.
On an out-of-the-box basis, for a generic search engine implementation, my opinion (which should be taken lightly, since it’s only the opinion of a single developer) would be to implement it with Solr.
This resolution comes based on the features that Solr provides out of the box. Some of them, being quite important for a search engine, such as wildcard support, rich text format parsing, and so on.
On the other hand, if your needs are as specific as indexing content from a database, then Sphinx would be the way to go, since it appears to have integration with that kind of content natively, which would assure a higher performance over other solutions.
Full text search APIs
In this case, we’ll be comparing APIs, which could be thought as “tools” to add to the “tool box” provided by the programming language you’re using.
Our two contestants are the Lucene Project and the Xapian project. They both have quite a number of followers and applications that use them, so lets see what else they have in common:
- They’re both highly portable. Lucene is written in JAVA, and thus it works on any JAVA capable OS. Xapian on the other hand is written in C++ and has support for most of the operative systems on the market.
- They both support rich text formats, which is definitely a plus if you’re trying to index any type of information (no need to pre-process them yourself).
- They both support advance search mechanics, like wildcards, proximity search, stemming, and so on.
- They’re both able to add data into their index on a real-time basis, making information accessible immediately.
Basically, Lucene and Xapian are both very much alike: they’re both very powerful and very customizable. And yet, they have some differences, as we’ll see next.
Some of the differences
- Lucene does not support faceted search, whilst Xapian does.
- Lucene does not support spelling corrections out of the box, whilst Xapian does.
- Xapian has support for synonyms out of the box, Lucene does not.
- Lucene has been ported into other programming languages (like PHP, Ruby, C#, etc, etc) providing Lucene-like APIs for those languages.
- Xapian provides bindings for other languages, but maintains a core API that’s always the same, securing that no matter which binding it is that you’re using, you’re always working with the official distribution and not a poorly done copy.
So, which one is better?
In this case, the answer (according to my research) to this question is not, as one would expect, the same as the one between Sphinx and Solr.
According to what we can see from our list of common and different features, one would be inclined to believe that Xapian is the way to go (at least I am!) since it is clearly superior (on an out-of-the-box basis of course).
So, why would we ever pick Lucene? Well, for starters, Lucene’s ports tend to have a good integration with some of those languages’ most famous frameworks (on php, Zend framework implements the Lucene search API, Ruby has it’s own implementation as well, called Ferret, which in turn has a RoR plug-in, and so on), which would ease the development process considerably. This could actually be a major point in favor of Lucene, if what we’re already using (or are planning on using) one of those frameworks.
All in all, there are many solutions out there worth the try, even though I’ve only covered those that would appear to be the four most known (or used) options. There are others like DataparkSearch, Ht-//Dig, mnoGoSearch, KinoSearch, and a very long etc, which might be the right pick for your needs.
What is really important to remember is that database engines are not the only solution out there for data handling. And also that full text search solutions are not the silver bullet needed to kill the proverbial werewolf that represents our searching problems either.
We need to think about our needs very carefully before choosing a technology or we can end up having that proverbial wolf biting our rear…