18 Dec 2008 23:22
TAGS: dev java lucene php search wikidot
Lucene-based search - a brand new feature has been being introduced to Wikidot code for quite a long time.
After that time of developing and dealing with performance problems (searching the whole Wikidot in 3 seconds is too long!) it's time for test this thing!
The whole thing is about Search All Sites module for Wikidot Open Source. This entry is mainly for those of you that run their own Wikidot services. More info on getting Wikidot software run can be found on Ed's site.
After installing Wikidot Open Source from current version, you can instantly test the new Search All Sites module. Just navigate to /search:all page of your main wiki. Example: if your wiki farm runs on the domain mydomain.com and your main wiki is www.mydomain.com, navigate to http://www.mydomain.com/search:all.
Updating to the new search engine from already installed version is quite tricky, because you need to pre-populate the search index (which is the actual file that is searched by the index, when looking for term user entered).
You can try the following commands:
- obtain root priviledges
- navigate to your Wikidot directory (it is /var/www/wikidot by default), update the code and run the lucene_bootstrap.php script as your lighttpd user
cd /var/www/wikidot svn update cd tests sudo -u www-data php lucene_bootstrap.php
This command adds every page to the index (normally located at /var/www/wikidot/tmp/lucene_index). Once indexed a page can be searched (if the site containing the page is public or you're a member of the site). The command prints a dot for each 10 indexed items (item is every page and forum/comments thread).
If this runs smoothly (i.e. no error, Segmentation fault at the end is OK, but memory exhausted is not OK) you have all your sites indexed and ready to search through.
When it fails: you can increase the max_memory setting in the corresponding php.ini file and re-run the command. There is no bad thing in running this command more than once as indexing a page always deletes the page from index before adding it again.
Just go to /search:all location at your main wiki and search for some content.
Also you need to update your crontab file. Add:
* * * * * www-data /var/www/wikidot/bin/job.sh UpdateLuceneIndexJob
to your /etc/crontab (assuming you have wikidot in /var/www/wikidot/). This will add an every-minute job indexing pages and threads queued to index when saving or changing public/private site state.
- First of all the new search applies only to the Search All Sites i.e. Search This Site works in the old way.
- Search uses titles and tags intelligently
- pages with the exact search phrase in the title are placed higher in the result list
- pages with tags matching search phrase are quite high in the result list
- pages with title matching search phrase are quite high in the result list
- pages with content matching search phrase are somewhere low in result list
- pages with parts of search phrase matching titles and tags can be higher in the result list than the pages having content matching even the exact phrase
- this all means: tags and titles are more important than content for the search engine
- You can narrow your search to only selected wikis
- append site:site1,site2,site3 (no spaces between them) to your search query. Example: search for "gabrys site:www,community" searches for gabrys in titles, tags and contents of pages and threads inside of sites www.yourdomain.com and community.yourdomain.com (supposing your Wikidot installation runs on yourdomain.com)
- The search includes public sites plus sites you are a member of. Also the results from your sites are generally more relevant to the search engine (i.e. they appear higher than the results from other sites)
- The search results for given phrase for given user are cached (if memcached is used) for a few minutes. This makes the search even more smooth (no need to search the index again when user only switches the result page from 2 to 3 for example)
If you don't run and don't want to run your own Wikidot installation, you can try the new features on the following site:
Like Google's way to highlight the searched words in Google-cached versions of result pages? Now you can add this feature to your Wikidot installation as well.
Open conf/wikidot.ini file and append those lines:
[search] ; enables highlighting of search phrase in the resulting documents highlight = true
This will highlight the words user searched for using:
- Google Search
- Yahoo Search
- Your Wikidot installation search:all page
Need more performance or memory limit exhausted
We experienced some low performance when searching through 2 millions of pages and threads of Wikidot.com. The search results were generated in about 3 seconds. This was not enough for us, so we manage to speed things up using the native Java Lucene implementation for searching the index. This works because we use PHP Lucene implementation that is compatible with the Java's one. This means we can index page with PHP and search with Java. And we do it! If you want do this too (experiencing low search performance of getting memory exhausted error messages), just add the following lines to your conf/wikidot.ini file:
[search] ; enables the use of Java for searching use_java = true
- if you already have [search] section in the conf/wikidot.ini file, just add the use_java = true line in the search section
- enabling Java for searching requires you to install java executable for your system. You should know how to do this (try sudo aptitude install openjdk-6-jre).
- you don't need any Java libraries as we already bundled everything needed in the .jar file. The Java source and Ant build script is located normally at the /var/www/wikidot/java directory (assuming you installed the wikidot in /var/www/wikidot.
Once we assure the search is stable and gives relevant results, we'll introduce it to the Wikidot.com service. I calculated that indexing all the sites would take about 3 days! But searching is done in less than 1 second (using the Java program).
I'm looking for your comment on the features. Especially if you've tried them yourself!