When it comes to data storage and data handling, we developers, are quite used to working with database engines (MySQL, Oracle, SQLite, etc) and database query languages.
Depending on the application being developed, one of these solutions can be more than enough to meet our needs and that is what we usually end up using.
But there are times, when the amount of information to be handled is so big (we’re talking about millions of rows of information) and the needed response times are so low (we’re talking about a few milliseconds time) that we need solutions designed specifically for searching large amount of information instead of generic data handling ones.
What’s so good about them?
While most database engines work great with very dynamic loads of information (i.e.: where new information needs to be shown in real-time), and managing highly complex relational data, sometimes what we need is actually a solution that can search through a great amount of data in a very short time, giving the user a quick response.
The solutions we’ll be talking about here, do just that, they give the developers a way of accessing information at a striking speed, but at a price: the information used is not always the most updated one1. Another lose to consider here, is the fact that these searching solutions do not manage relational information like relational databases do. So you need to be careful when choosing one over the other, and consider exactly what your needs are.
How do they work?
The way they work, is, roughly, by indexing all the information, using some of several indexing mechanics (like the use of stop words2, stemming3, etc). Once the information has been indexed, it is there (on the index) where the search takes place, instead of searching directly on the unprocessed data.
Some of the most interesting features presented by these APIs are:
- Ranked search: The results can be returned ordered by the relevance of the match.
- Proximity search: Because sometimes, the searched terms are not written correctly, these APIs can return results that are close to the original ones.
- The use of wildcards: When searching, not always do we know the exact words, that’s when wildcards come into place, allowing us to specify a basic structure for our search and letting the search engine find the matching results.
Comparing some of them
Considering this scenario, today we’ll be discussing some alternative options for data handling and querying. I’ve chosen these projects based on general opinion on internet, according to most places, these guys are the most known/used alternatives out there (there are many others of course).
Our options are:
- Lucene: An Apache developed project, written entirely in Java.
- Xapian: An Open Source Search Engine Library, released under theGPL and written in C++.
- Sphinx: An open source full text search server with a high level of integration with database engines.
- Solr: Solr is a standalone enterprise search server with a REST-like API that uses Lucene as its base full-text search engine.
What am I looking for?
Before even comparing products, it is important to understand that there is rarely a single solution to our problems. That almost every time, there is more than one product that will meet our needs.
Knowing this, I’ll be comparing products based on an “out-of-the-box” basis. This means, that no addons or plugins or anything will be taken into account for comparison purposes (we have to draw the line somewhere).
My chosen criteria
I won’t be benchmarking product’s speed against each others, if that’s what you’re looking for then stop reading here. Even though it might be a good measure of how good a product is (in some cases) I feel that in this particular case, where we don’t really have a specific problem to solve, those numbers would not bring anything of value to the table.
What I will be taking into account is the amount of features (search related features actually) that come with a product out of the box. I fell that this approach could give a basic understanding of which product stands over others for any generic given problem. In the end, based solely on this information, my resolution can be wrong (or misleading) for some specific cases, but then again, that’s not what I’m trying to solve, since in order to tackle specific scenarios this article would have to be extended into a small book.
Another point of notice, is the fact that all of the presented solutions (as we’ll see in a minute), are long term projects, with active support and under constant development (and improvement).
Now that we know what we’ll be comparing and what to look for on our projects, lets take a look at them:
Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
It has support for indexing several industry standards, like Word documents, Excel, PowerPoint, PDF, HTML files and many more.
High performance indexing
- Over 95GB/hour on modern hardware
- Incremental indexing as fast as batch indexing
- Index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
- Ranked searching — best results are returned first
- Many powerful query types:
- Phrase queries: allows for more than one word to be used on the search.
- Use of wildcards: The API supports the use of “*” and “?” as wildcards on queries.
- Proximity queries: Use of ~ to find look-a-like words.
- Range queries: Used for range of numbers (1 to 10), sorted words (Amanda to Sabrina), date range and more.
- Fielded searching (e.g., title, author, contents)
- Sorting by any field
- Multiple-index searching with merged results
- Allows simultaneous update and searching
- Available as Open Source software under the ApacheLicense which lets you use Lucene in both commercial and Open Source programs
- 100%-pure Java
- Implementations in other programming languages available that are index-compatible
- PHP: ZendSearch
- Ruby: Ferret
- C++: CLucene
- Many others!
Who uses it?
This is a widely used API, and some of the major software companies that use it are:
- Eclipse: The Eclipse IDE uses Lucene for searching it’s documentation.
- CNETReviews – Product Listing pages with Filter drill downs.
- IBMOmniFindPersonale–mailSearch – A semantic search engine for your e-mail.
- Others like Linkedin, AOL, etc (full list here)
Xapian, like Lucene, is another open source, widely used full text search API, but unlike the first one, this one is written in C++.
Some basic features:
- Highlyportable – runs on Linux, Mac OS X, FreeBSD, NetBSD, OpenBSD, Solaris, HP-UX, Tru64, IRIX, and probably other Unix platforms; as well as Microsoft Windows and OS/2.
- Written in C++.
- Ranked probabilistic search – important words get more weight than unimportant words, so the most relevant documents are more likely to come near the top of the results list.
- Relevance feedback – given one or more documents, Xapian can suggest the most relevant index terms to expand a query, suggest related documents, categorize documents, etc.
- Phrase and proximity searching – users can search for words occurring in an exact phrase or within a specified number of words, either in a specified order, or in any order.
- Full range of structured boolean search operators (“stock NOT market”, etc). The results of the boolean search are ranked by the probabilistic weights. Boolean filters can also be applied to restrict a probabilistic search.
- Supports stemming of search terms (e.g. a search for “football” would match documents which mention “footballs” or “footballer”). This helps to find relevant documents which might otherwise be missed. Stemmers are currently included for many languages (not only English), like Danish, Dutch, English, Finnish, French, German and more.
- Wildcard search is supported (e.g. “xap*”).
- Synonyms are supported, both explicitly (e.g. “~cash”) and as an automatic form of query expansion.
- Xapian can suggest spelling corrections for user supplied queries. This is based on words which occur in the data being indexed, so works even for words which wouldn’t be found in a dictionary (e.g. “xapian” would be suggested as a correction for “xapain”).
Xapian supports this added feature, called faceted search. What this basically means, is that search queries can have categories of interest and so can the terms and information indexed. This allows Xapian to narrow your search, to a particular sub-set of categories.
- Supports database files > 2GB – essential for scaling to large document collections.
- Platform independent data formats – you can build a database on one machine and search it on another.
- Allows simultaneous update and searching. New documents become searchable right away.
Who uses it?
- Debian GNU/Linux
- Debian use Xapian for:
- Languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish
- Database size: 3.9 million messages (2009-05-18)
- Database size: over 6700 pages (2009-05-18)
- Database size: over 30000 packages (2009-03-16)
- Xapian is used withOpenBib to search the library’s OPAC (Online Public Access Catalogue)
- Application: Free online community with profiles, chat, boards, blogs, games, and more
Our third item on the list is Sphinx, an open source full text search server. This one was designed for indexing database content. It currently supports MySQL, PostgreSQL, and ODBC-compliant databases as data sources natively.
The integration with RDBMSs (Relational Data Base Management Systems) allows us to use Sphinx as if it were a database itself. As it is with the rest of these solutions, this might be the thing that you need for your application or it can be completely irrelevant.
Some basic features:
Performance and scalability
- Indexing performance. Sphinx indexes up to 10-15 MB of text per second per single CPU core, that is 60+ MB/sec per server (on a dedicated indexing machine).
- Searching performance. According to their statistics, searching through 1,000,000-document (a 1.2GB text collection) runs at around 500 queries/sec on a 2-core desktop machine, with 2GB or RAM.
- Scalability. Biggest known Sphinx cluster indexes almost 5 billion documents, resulting in over 6 TB of data.
- Batch and Real-Time full-text indexes. Two index backends that support both offline index construction and incremental on-the-fly index updates are available.
- Non-text attributes support. An arbitrary number of attributes (product IDs, company names, prices, etc) can be stored in the index and used either just for retrieval (to avoid hitting the DB), or for efficient Sphinx-side search result set post-processing.
- SQL database indexing. Sphinx can directly access and index data stored in MySQL (all storage engines are supported), PostgreSQL, Oracle, Microsoft SQL Server, SQLite, Drizzle, and anything else that supports ODBC.
- Non-SQL storage indexing. Data can also be streamed to batch indexer in a simple XML format called XMLpipe, or inserted directly into an incremental RT index.
- Easy application integration.Sphinx comes with three different APIs:
- SphinxAPI is a native library available for Java, PHP, Python, Perl, C, and other languages.
- SphinxSE, a pluggable storage engine for MySQL, enables huge result sets to be shipped directly to MySQL server for post-processing.
- SphinxQL lets the application query Sphinx using standard MySQL client library and query syntax.
- Advanced full-text searching syntax. It’s querying engine supports arbitrarily complex queries combining boolean operators, phrase, proximity, strict order, and quorum matching, field and position limits, exact keyword form matching, substring searches, etc.
- Rich database-like querying features. Sphinx does not limit you to just keyword searching. On top of full-text search result set, you can compute arbitrary arithmetic expressions, add WHERE conditions, do ORDER BY, GROUP BY, use MIN/MAX/AVG/SUM, aggregates etc. Essentially, full-blown SQL SELECT is supported.
- Relevance ranking. Sphinx claims to have a better-than-average ranking algorithm, analyzing keyword proximity, and ranking closer phrase matches higher, with perfect matches ranked on top. Also, ranking is flexible: you can choose from a number of built-in relevance functions, tweak their weights by using expressions, or develop new ones.
- Flexible text processing. Sphinx indexing features include full support for SBCS and UTF-8 encodings (meaning that effectively all world’s languages are supported); stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more.
- Distributed searching. Searches can be distributed across multiple machines, enabling horizontal scale-out and high availability.
Some things to notice that are not provided by Sphinx:
- Wildcards: Sphinx (on it’s release version) does not allow the use of wildcards on it’s searches, although it’s latest beta version does.
- Sphinx cannot parse rich text format documents (like Word documents, PDF files, etc).
Who uses it?
- Craigslist: A classifieds page, with more than 20 billion page views per month.
- DailyMotion: Widely used video site.
- NetLog: Social network with more than 77million users.
Solr is an open source enterprise search platform from the Apache Lucene project. This means that it uses (and extends) the Lucene full-text search API we talked before, for all the full-text searches. This particular feature is of note, since this allows Solr to improve whenever the Lucene project integrates an improvement into its product. This is especially true as of May 2011, since both projects decided to merge.
It also differentiates from the other solutions presented here, because of it’s unique REST-like interface. You can put documents in it (called “indexing”) via XML, JSON or binary over HTTP and then you query them via HTTP GET and receive XML, JSON, or binary results.
This makes Solr completely language independent, allowing any developer using an HTTP capable programming language to interact with this platform.
Some basic features:
- Advanced Full-Text Search Capabilities
- External file-based configuration of stopword lists, synonym lists, and protected word lists
- Many additional text analysis components including word splitting, regex and sounds-like filters
- Sort by any number of fields, and by complex functions of numeric fields
- Spelling suggestions for user queries
- “More Like This” suggestions for given document
- Configurable Query Result, Filter, and Document cache instances
- Pluggable Cache implementations, including a lock free, high concurrency implementation
- Cache warming in background
- When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
- Auto warming in background
- The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.
- Fast/small filter implementation
- Full logging control
- Text analysis debugger, showing result of every stage in an analyzer
- Web Query Interface w/ debugging output
- parsed query output
- Lucene explain() document score detailing
- explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher.
Who uses It?
- http://www.whitehouse.gov/ – Uses Solr via Drupal for site search w/highlighting & faceting (More details:1,2)a
- AOLis using Solr to power its channels
- digg uses Solr for search
- NASAPlanetaryDataSystem (PDS) uses Solr to search for dataset, mission, instrument, target, and host information
- reddit uses Solr for search
- And many more (http://wiki.apache.org/solr/PublicServers)
- This is not always the case, some of the latest engines implement incremental indexing (or some other strategy) in order to solve this problem.
- Words that carry too little meaning or are too common to be of use during the search process.
- The process of relating words that are written different but have the same stem, allowing for cases like “drive” being equal to “drove” or “driven”.