03 Dec 2008 20:58
TAGS: dev fulltext lucene php search wikidot zend
As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.
We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.
Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:
- using external service — prestige and dependence
- displaying ads on search results — for those who don't use AdBlock
- pages get indexed after some significant time
- only public wikis can be indexed
One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no big/real impact on how Google treats different portions of pages.
This leads to a conclusion, that we need a search engine.
We would like:
- to treat tags as more important than the regular content
- not to search from Wikidot.com static elements (like the footer on every page)
- allow searching all wikis available for given user
- all public wikis
- all private wikis that the user is a member of
Coming to technical details, one could say, we just need a generic full text search engine. We can use one available in our storage system or one of dedicated search-only engines.
Tsearch — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.
Lucene — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:
- application pulls some documents to the search index
- document is a webpage in our situation
- index is a datastore to be used when searching
- user queries the index with a query
- a bunch of documents is returned in the order of relevance
- the documents returned are more or less the same as the documents pulled to the index before
This mechanism requires populating index by application
- updating the index every now and then
- updating document on some change — like page edit
and requires us to define some functions that deal with search results
- the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document
- this makes the index smaller and faster
- this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID
Nutch would be a different — more Googlish — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:
The main advantage of using Nutch is that as a search result we get a formatted HTML document
- with links to items found
- with context of the search phrase quoted
- the search phrase words outlined in some way
This is very similar to what we get searching for some phrase with Google.
What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.
Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.
There is OpenSearch project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.
There is also a quite nice thing around: Zend_Search_Lucene. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).
It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.
I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.
There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).
The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500 MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.
When I have asked my friend about full text search engines he recommends, he pointed out Sphinx — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.
Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.
It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.
There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).