03 Dec 2008 20:58
TAGS: dev fulltext lucene php search wikidot zend
As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.
We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.
Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:
- using external service — prestige and dependence
- displaying ads on search results — for those who don't use AdBlock
- pages get indexed after some significant time
- only public wikis can be indexed
One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no big/real impact on how Google treats different portions of pages.
This leads to a conclusion, that we need a search engine.
We would like:
- to treat tags as more important than the regular content
- not to search from Wikidot.com static elements (like the footer on every page)
- allow searching all wikis available for given user
- all public wikis
- all private wikis that the user is a member of
Coming to technical details, one could say, we just need a generic full text search engine. We can use one available in our storage system or one of dedicated search-only engines.
Tsearch
Tsearch — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.
Lucene
Lucene — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:
- application pulls some documents to the search index
- document is a webpage in our situation
- index is a datastore to be used when searching
- user queries the index with a query
- a bunch of documents is returned in the order of relevance
- the documents returned are more or less the same as the documents pulled to the index before
This mechanism requires populating index by application
- updating the index every now and then
- updating document on some change — like page edit
and requires us to define some functions that deal with search results
- the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document
- this makes the index smaller and faster
- this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID
Nutch
Nutch would be a different — more Googlish — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:
The main advantage of using Nutch is that as a search result we get a formatted HTML document
- with links to items found
- with context of the search phrase quoted
- the search phrase words outlined in some way
This is very similar to what we get searching for some phrase with Google.
What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.
Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.
There is OpenSearch project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.
Zend_Search_Lucene
There is also a quite nice thing around: Zend_Search_Lucene. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).
It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.
I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.
There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).
The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500 MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.
Sphinx
When I have asked my friend about full text search engines he recommends, he pointed out Sphinx — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.
Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.
It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.
The choice
There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).

I think using Google is the best bet.
In fact, I would highly recommend using the Google Search API (or whatever it is) and completely integrate it with wikidot for searching all wikis and even individual ones.
My two cents. :)
— hartnell
"Early in life I learned that I was a firestarter. When I learned that I could light fires of passion I became a pyromaniac. My greatest desire is to set the world on fire." — Shawn Hartnell
without any adware blocks - as an application on your server ?
Have I mis-understood this from google ?
Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.
Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?
I think the current search engine per wiki is amazing — and superior to Google's. I say this because it searches content that is current, whether Google can only search content that it has seen (cached). If you update a page on the wiki, you would need to wait for Google to come to your wiki and update its cache for search results to render up-to-date ones.
But when it comes to searching all Wikidot sites, rendering out-of-date content doesn't really matter. If you want to find a specific thing on a particular site, just go to that site and search it. If you want to find a general context on Wikidot, then Google's out-of-date cache would do perfectly.
λ James Kanjo
Blog | Wikidot Expert | λ and Proud
Web Developer | HTML | CSS | JavaScript
We have to mix a few things:
With the new module it could be possible to supply which sites to search, i.e. one could insert a module that searches for given phrase in 5 different wikis (probably content-related).
Just as easy as for example (THIS IS JUST A PROPOSAL. DO NOT TRY IT):
And you have a search box that searches for you in 3 wikis at once (and the sites are indexed immediately after each change), so you get very fresh results.
Piotr Gabryjeluk
visit my blog
I was doing similar research on external full text search engines and found this article (in spanish but you can translate) http://www.alfonsojimenez.com/2007/08/30-benchmark-lucene-en-php-vs-lucene-en-java which compares Java Lucene vs the PHP Zend one. According to this, the php one is considerably slower.
I'm leaning towards Sphinx myself. Has a PHP Api and if it can power craigslist well…
Java Lucene is a few times faster than PHP version. It's like 1.5 s (PHP) to 0.5 s (Java) when searching for some simple phrase and like 10 s (PHP) to 1.5 s (Java) when searching for something with ~ (which means "similar" in Lucene query language). But this was tested for about million of documents. Normally, when you don't have so many of them, PHP is really OK, as difference between 0.2 and 0.05 is almost unnoticeable.
In Wikidot we use PHP to index things (because we have nice database-object mapping in PHP) and Java to search (but then PHP to interpret the search results and ask database for more information on found items).
Piotr Gabryjeluk
visit my blog
Also, the benchmark doesn't have much to do with real life, as normally you don't search for the same thing 100 times in raw, usually a phrase is searched once. My benchmark tests searching for a phrase once.
This includes delays of loading all Java/PHP classes and stuff. I run the command a few times before to let the kernel cache files and then I measure times of Java and PHP searching for the same phrase.
Java is then about 3 times faster than PHP on a quad-core processor and 500 MB Lucene index (hopefully, with over 1 GB free RAM, all sitting in system cache — in RAM).
Piotr Gabryjeluk
visit my blog
Post preview:
Close preview