New Search For Wikidot
tags: dev fulltext lucene php search wikidot zend
03 Dec 2008 20:58
As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.
We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.
Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:
- using external service — prestige and dependence
- displaying ads on search results — for those who don't use AdBlock
- pages get indexed after some significant time
- only public wikis can be indexed
One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no big/real impact on how Google treats different portions of pages.
This leads to a conclusion, that we need a search engine.
We would like:
- to treat tags as more important than the regular content
- not to search from Wikidot.com static elements (like the footer on every page)
- allow searching all wikis available for given user
- all public wikis
- all private wikis that the user is a member of
Coming to technical details, one could say, we just need a generic full text search engine. We can use one available in our storage system or one of dedicated search-only engines.
Tsearch
Tsearch — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.
Lucene
Lucene — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:
- application pulls some documents to the search index
- document is a webpage in our situation
- index is a datastore to be used when searching
- user queries the index with a query
- a bunch of documents is returned in the order of relevance
- the documents returned are more or less the same as the documents pulled to the index before
This mechanism requires populating index by application
- updating the index every now and then
- updating document on some change — like page edit
and requires us to define some functions that deal with search results
- the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document
- this makes the index smaller and faster
- this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID
Nutch
Nutch would be a different — more Googlish — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:
The main advantage of using Nutch is that as a search result we get a formatted HTML document
- with links to items found
- with context of the search phrase quoted
- the search phrase words outlined in some way
This is very similar to what we get searching for some phrase with Google.
What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.
Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.
There is OpenSearch project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.
Zend_Search_Lucene
There is also a quite nice thing around: Zend_Search_Lucene. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).
It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.
I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.
There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).
The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500 MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.
Sphinx
When I have asked my friend about full text search engines he recommends, he pointed out Sphinx — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.
Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.
It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.
The choice
There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).
Comments: 7
Doctrine PHP
tags: database dev doctrine orm php
30 Oct 2008 19:48
A few days ago, I found Doctrine Project which is an object-relational mapping (ORM) solution for PHP. It seems really powerful and actually very similar to Wikidot DB Layer (especially one in Wikidot 2).
Doctrine is a really cool project and has features of:
- Wikidot DB
- Zend Framework DB
- Hibernate (ORM for Java)
- Ruby on Rails DB
Let's start from the beginning of the list.
Doctrine is most similar to Wikidot DB, because both:
- are ORM implementations for PHP
- use abstract (non-PHP, non-SQL) data definition language
- for Wikidot it's XML-based format
- for Doctrine it's YAML-based one
It's similar to Zend Framework DB, because of:
- using magic PHP methods for accessing objects properties
Hibernate and Doctrine shares:
- completeness of the DB layer
- higher abstraction than other DB layers
Ruby On Rails concepts in Doctrine:
- using YAML as a model definition language
- independence of database engine
Example of model definition:
User:
columns:
id:
type: integer
primary: true
autoincrement: true
login:
type: string(64)
unique: true
notnull: true
realname:
type: string
password:
type: string(64)
notnull: true
im:
type: string(64)
token_active:
type: string
token_created:
type: timestamp
Bet:
columns:
id:
type: integer
primary: true
autoincrement: true
brand:
type: string(100)
notnull: true
qty:
type: integer
default: 1
notnull: true
unit:
type: string(100)
notnull: true
BetUser:
columns:
bet_id:
type: integer
primary: true
user_id:
type: integer
primary: true
wins_if:
type: string
notnull: true
detect_relations: true
Doctrine comes with a CLI tool, that offers the following things:
Doctrine Command Line Interface
./doctrine.php create-tables
./doctrine.php rebuild-db
./doctrine.php generate-models-db
./doctrine.php generate-models-yaml
./doctrine.php generate-yaml-db
./doctrine.php load-data
./doctrine.php build-all-load
./doctrine.php generate-migrations-models
./doctrine.php build-all
./doctrine.php create-db
./doctrine.php migrate
./doctrine.php generate-yaml-models
./doctrine.php dump-data
./doctrine.php dql
./doctrine.php generate-migrations-db
./doctrine.php compile
./doctrine.php generate-sql
./doctrine.php drop-db
./doctrine.php generate-migration
./doctrine.php build-all-reload
The tool can be used to generate PHP files from YAML files, generate SQL files (for given DB engine), drop and create the database and more.
The dql seems to be an interesting option. Doctrine supplies its own backend-independent SQL-like language for querying and updating the DB. These operations can be done with the dql option. Although I haven't tested it yet, it seems perfect for some cronjobs or other automatic tasks that don't need to be coded in PHP.
<OFF-TOPIC>
Now, I'm converting my nuclear project — opiwo.com (bet a beer in Polish) to Doctrine ORM (from Zend Framework's one) to see how it works in reality. The project uses also other nice technologies like JSON-RPC.
I'm willing to minimize the work needed to launch it soon, by using bleeding edge technologies and web service programming concepts. The whole user interface part is programmed in JavaScript, jQuery (with many plugins) and static HTML files. Only pure data is fetched from server (with JSON-RPC). This gives more power to the server and user more responsive interface to the user. On the other side, the website may be more CPU-intensive (I hope not too much).
</OFF-TOPIC>
Please forgive me I'm now quite excited about the Doctrine. It's just how I react to some really cool (well-designed) things. I hope it's really that cool :).
Comments: 2
Django-like routing in PHP
tags: dev django php regexp route
05 Jul 2008 16:06
As I've recently work with Django, the way it does the URL-based routing seemed really cool for me. I missed that in PHP, so I decided to code something like this.
Here is a class that uses (extends) my Controller class that does the routing:
class Controller_Ajax_Auth extends Controller_Ajax { protected $routes = Array( ':^info$:' => 'info', ':^challenge$:' => 'challenge', ':^login$:' => 'login', ':^logout$:' => 'logout', ); protected function info($url) { $r = Array(); /* something */ $this->ajaxResponse($r); } protected function challenge($url) { /* $q = something */ $this->ajaxResponse($q); } protected function login($url) { /* set $auth to true if logged */ $this->ajaxResponse($auth); } protected function logout($url) { /* logout */ $this->ajaxResponse(null); } }
This mainly routes URLs info, challenge, login and logout to corresponding methods in the same object.
But you can route out of the object to other Controller subclass instance:
protected $routes = Array(
':^auth/(.*)$:' => 'Controller_Ajax_Auth',
);
This gets URL and passes what's after auth/ to the new object of class Controller_Ajax_Auth (see the code above). Generally the first ()s in the left side of each line define what's passed to the method/object on the right side.
The controller has abstract errorHandler and defaultAction methods that need to be overridden. The first is called when a exception is thrown in a performed action. The latter is called, when routing comes to some object and then no routing line matches.
Comments: 1
Mirror Server
tags: blog dev lighttpd mirror php wikidot
24 Jun 2008 12:55
Today I've (almost) managed to create a mirror server for wikidot.com service.
Features:
- CentOS distribution
- almost live Wikidot read only mirror
- database is replicated from the original service in real time to this server
- user uploaded files are replicated in real time to this server using FS mirror
- avatars are to be mirrored with rsync every now and then
- uses Portable IP address: 67.228.37.27
- lighttpd serves ALL content with FastCGI PHP
- database is read-only (as being replication slave)
- CVS configured to use SSH keys (no password asking)
- Wikidot PHP source mainly from the current production server
- Improvements (from CVS): uploaded files served like in OpenSource version
Problems:
- FS mirror is not 100% exact, it may not synchronize some (little fraction of) files every now and then, so we must rsync them additionally, to make sure nothing's lost
- if you were logged in to Wikidot before, it'll complain about not being able to write to ozone_session (because it's read only)
- Flickr Gallery not working and causing the whole page to display just nothing
- magic file recognition not working (it may be a problem in PHP configuration or an extension):
PHP Warning: finfo_open(): Failed to load magic database at '/usr/share/misc/magic'. in /var/www/www.wikidot.com/wikidot/php/utils/- on line 3
PHP Warning: finfo_file(): supplied argument is not a valid file_info resource in /var/www/www.wikidot.com/wikidot/php/utils/- on line 4
PHP Warning: finfo_close(): supplied argument is not a valid file_info resource in /var/www/www.wikidot.com/wikidot/php/utils/- on line 5
UPDATE: Flickr problem solution:
- yum install php-pear-HTTP-Request
- chgrp lighttpd /var/lib/php/session
Remember:
- when switching to mirror, we must restart memcached (or force to invalidate every item in it)
