10 Jan 2009 12:48
TAGS: dev high load lucene search wikidot
As you may know I'm implementing a new search engine for Wikidot.
This seemed quite easy at first having nice Lucene implementation in PHP — included in Zend Framework and indeed during tests it was fast, simple and powerful. But this was tested on about 100,000 documents (document is a Wikidot page or forum thread) and we have about 2,500,000 documents in Wikidot now. And this is where the problem begins.
After indexing roughly 1,800,000 documents there were some problems with memory consumed by the indexing process (500 MB merory limit was not enough in SOME cases).
Even earlier I realized that the search times weren't good enough. This is why I implemented the searching part in Java, which is the native platform for the Lucene indexer. This sped things up.
Do you think indexing a document in just a second is fast? I though this is a good result. Indexing a document takes about 0.2 s when having small amount of documents in the index already. But when you have 400,000 documents in index, adding another document to the index takes about 0.4 s. And having even this "good" indexing time (below a second), indexing the whole Wikidot would take at least a few days.
This leads me to a conclusion, that Wikidot is really BIG.
A similar situation also applied to the user uploaded files. There was a problem of a limit of filesystem reached, which was about 32,000 directories max in a single directory. Having all user-uploaded files in a directory structure of one-directory-per-wiki, this resulted in a problem when having more than 32,000 wikis.
Replicating this structure to another machine (also known as live-backup of user-uploaded files) was also quite a challenge, because we've reached a limit of directory watches in the kernel-level filesystem-monitoring system (inotify).
It all shows, that things that seem easy are not necessarily easy because of the high scale of the Wikidot, which touches some limits on nearly every piece of software we use. But this is also a great chance to really test those projects and how they react to such a high load.