<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wikidot="http://www.wikidot.com/rss-namespace">

	<channel>
		<title>Piotr Gabryjeluk dev blog</title>
		<link>http://piotr.gabryjeluk.pl</link>
		<description>Blog, photos and developer notes of Piotr Gabryjeluk, one of Wikidot.com developers.</description>
				<copyright></copyright>
		<lastBuildDate>Sat, 31 Jul 2010 22:35:28 +0000</lastBuildDate>
		
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:july-news</guid>
				<title>July News</title>
				<link>http://piotr.gabryjeluk.pl/dev:july-news</link>
				<description>

&lt;p&gt;Some of you may be more used to me posting more often, than in last time.&lt;br /&gt;
Some of you may wonder why I stopped blogging.&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 28 Jul 2009 15:41:03 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>Some of you may be more used to me posting more often, than in last time.<br /> Some of you may wonder why I stopped blogging.</p> <div class="content-separator" style="display: none:"></div> <h2><span>Brussels</span></h2> <p>Last month was full of adventures. It started 1st of July with me going to Brussels meat our friend to talk about wikipedia-like site about art. We're going to help this man build the most complete site about art using Wikidot software!</p> <p>BTW, this was my first flight in lifetime. Quite a strange feeling, but generally fine.</p> <h2><span>Forms</span></h2> <p>I was working on some nice technical and UI improvements to Wikidot, that is crucial for the art site (but really really nice for Wikidot as well, like forms for editing, entering and viewing structured data to wiki pages).</p> <h2><span>Search issues</span></h2> <p>That week was also spent on some massive Wikidot.com search engine tweaks. A stupid one-line bug, which was <strong>not</strong> exporting proper LC_ALL environmental variable in indexing script, caused many sites that used Asian or East-European languages to be not indexed (most notably the great <a href="http://istorijska-biblioteka.wikidot.com/">ИСТОРИЈСКА БИБЛИОТЕКА</a>). At first we though that we can re-index the broken sites, but our re-indexing mechanism was way too slow (would last for weeks for all broken sites).</p> <p>Pieter then challenged me. He said he can index whole Wikidot in 6 hours. I thought it's not even possible, but then I started to work on that and I managed to index the whole Wikidot in less than 2 hours without indexing tags at first. Then with tags, it took 2 hours and 10 minutes or so. That was damn fast!</p> <p>Inspired by this and an accident of disk full error on /var partition of our webserver (but this is why we keep user-uploaded files and other important things on separate disks), I also rewrote the incremental indexer, to work in similar way to the whole-Wikidot-re-indexer.</p> <h2><span><tt>search-api reindex</tt></span></h2> <p>If you care about some technical details:</p> <ul> <li>all search operations are issued with use of search-api, a separate program that can: <ul> <li>re-index whole Wikidot</li> <li>queue indexing page/thread</li> <li>queue deleting page/thread</li> <li>queue re-indexing site</li> <li>flush queue</li> </ul> </li> <li>search-api is written in Python</li> <li>search-api uses PyLucene - a native Java Lucene library binded to CPython objects with PyJCC. Compiled with GNU Java Compiler to native code (like C programs), this binding has improved performance over using Lucene with Sun's Java.</li> <li>before rewriting it to only-Python, search-api was written in BASH and was a wrapper to: <ul> <li><tt>java -jar searchApiHelper.jar search "phrase-to-search"</tt></li> <li><tt>php search-api-helper.php flush</tt></li> </ul> </li> <li>search-api also takes care of file locking to assure that <ul> <li>only one process tries to modify the index</li> <li>items are added to queue one-after-another</li> <li>when doing some big index modification (read: full re-index) queue is not flushed (so that after the re-index all changes are applied to new index)</li> <li>when flushing queue takes more time, and cron tries to run more flushing processes, they simply end (so only one process flushes the queue at a time)</li> </ul> </li> </ul> <h2><span>Union of Rock Festival</span></h2> <p>Just after week spent in Brussels in nice hotel I went to Węgorzewo, Mazury (Poland biggest lakes distinct) to have fun on rock music festival. Unfortunately, the music level was not very impressive, so I mainly enjoyed the atmosphere on the camping area.</p> <p>The weather was not great. It was wet everywhere, the ground was covered in 20 centimeters of mud and it was hard to walk around without getting dirty. But during the first day of being there, I learned to do that.</p> <h2><span>Improved workflow at Wikidot</span></h2> <p>Some of you noticed, that recently we started to work more efficiently, but this is not quite true. In fact we work as efficiently as before, but we are better organized, and have better priorities on tasks. Also we keep track of what we do, so we can then tell what we've done. So for us, this is a little more work of "documenting" our work (so maybe we work even less efficiently than before?), but for the outside world, we make more noise (in a positive meaning) around that. So basically, people know what we do, what we are going to do, when they can expect changes and most importantly, they understand why some feature request is being postponed. This is (and was) because we have more important things to do, but before they couldn't tell it.</p> <p><span class="printuser avatarhover"><a href="http://www.wikidot.com/user:info/squark" ><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/160/160264/a16.png" alt="Squark" style="background-image:url(http://www.wikidot.com/userkarma.php?u=160264)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/160/160264/a16.png" alt="Squark" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=160264,sizingMethod='scale')"/><![endif]--></a><a href="http://www.wikidot.com/user:info/squark" >Squark</a></span> turned into a professional project manager, that manages our time. <span class="printuser avatarhover"><a href="http://www.wikidot.com/user:info/pieterh" ><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/0/99/a16.png" alt="pieterh" style="background-image:url(http://www.wikidot.com/userkarma.php?u=99)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/0/99/a16.png" alt="pieterh" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=99,sizingMethod='scale')"/><![endif]--></a><a href="http://www.wikidot.com/user:info/pieterh" >pieterh</a></span> decided to <a href="http://blog.wikidot.com/">talk</a> to the Community and listen to their complaints (he reads or at least skims every post on Community forums). He tells Łukasz what needs to be done, Łukasz knows when we will have time to do this. This way communication inside Wikidot improved. Also we (<span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:1" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/0/1/a16.png" alt="michal frackowiak" style="background-image:url(http://www.wikidot.com/userkarma.php?u=1)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/0/1/a16.png" alt="michal frackowiak" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=1,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:1" target="_blank">michal frackowiak</a></span> and me) no longer look on Community forums (some of you may regret), but this allows us to concentrate on our work.</p> <h2><span>The work continues</span></h2> <p>As I mentioned before, we want to introduce a great feature to Wikidot, which is <em>forms</em>. But the implementation now concentrates on the <a href="http://www.wikidot.org/">open source version of Wikidot software</a> (once it's ready, working and tested we'll copy the feature to the Wikidot.com service).</p> <h2><span><tt>aptitude install wikidot</tt></span></h2> <p>As forms is a huge change, I started to prepare a good ground for it and closed most important bugs in Wikidot open source and I'm about to start making Ubuntu packages for it to allow even-simpler installation on Debian-based systems. Now the installation involves only <a href="http://www.wikidot.org/installation-guide">6 child-easy steps</a> and in fact can be done by copying&amp;pasting a few commands.</p> <h2><span>Yesterday's party</span></h2> <p>Yesterday I went to met some old-school-times friends in the heart of the city. It was meant to be a meeting for "a beer or two" but evolved into beer and dancing till morning. That was first time I get a morning bus (not even the first) to my home just after partying.</p> <p>It was such a great fun and great folks I met.</p> <h2><span>Summary</span></h2> <p>I hope with this long blog post (but divided into friendly sections ;) ) I recompensed long period of not-posting anything here.</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:lucene-replaces-local-searches</guid>
				<title>Lucene Replaces Local Searches</title>
				<link>http://piotr.gabryjeluk.pl/dev:lucene-replaces-local-searches</link>
				<description>

&lt;p&gt;As some of you know, some time ago I worked on a new &lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:wikidot-search-launched&quot;&gt;Wikidot search&lt;/a&gt;. It is used for the &lt;a href=&quot;http://www.wikidot.com/search:all&quot;&gt;Search all sites&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;This work was done, because searching the whole Wikidot database was far from being fast. At first we used Google Custom Search Engine to solve the problem, but we wanted to be independent from it. Also we wanted to include search results that are accessible by the person that searches but not by search engines (like private sites).&lt;/p&gt;
&lt;p&gt;This worked really good. The only problem was long time spent to display the results. It was like 20 seconds, which was far better than when using previous search, but much worse than Google. This was strange, because from previous tests, we calculated the average search time should be about a second or two.&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 23 Jun 2009 18:25:58 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>As some of you know, some time ago I worked on a new <a href="http://piotr.gabryjeluk.pl/dev:wikidot-search-launched">Wikidot search</a>. It is used for the <a href="http://www.wikidot.com/search:all">Search all sites</a> page.</p> <p>This work was done, because searching the whole Wikidot database was far from being fast. At first we used Google Custom Search Engine to solve the problem, but we wanted to be independent from it. Also we wanted to include search results that are accessible by the person that searches but not by search engines (like private sites).</p> <p>This worked really good. The only problem was long time spent to display the results. It was like 20 seconds, which was far better than when using previous search, but much worse than Google. This was strange, because from previous tests, we calculated the average search time should be about a second or two.</p> <div class="content-separator" style="display: none:"></div> <p>It started to be clear, when we noticed that only 20-30 searches per day are performed. The index is quite big, and it needs to be cached in RAM to work with sufficient performance.</p> <p>Today I did more tests under heavy load and it seams Lucene can handle big number of queries. When users search often, the index is partially or even fully cached by the filesystem and searches are really quick!</p> <p>But our main problem to solve today was slow local search a.k.a. "search this site". Moreover many concurrent queries were degrading database performance (not only for searching), so we decided to enable Lucene for local searches as well.</p> <p>I must say it works really nice, fast and has a nice set of syntax tricks you can do with it, for example you can search for pages with something in tags. Just search for <em>youtube tags:embed</em>. This would search for pages matching <em>youtube</em> (in tags, title or content) and with <em>embed</em> in tags. If no such pages are found, partial matches are also returned, like: pages matching <em>youtube</em> (but with no <em>embed</em> in tags) or pages with <em>embed</em> in tags, (but not matching <em>youtube</em>).</p> <p>To sum up, new search is faster, gives more accurate results, saves the database performance (which was the main goal) and allows nicer syntax than the old one.</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:wikidot-search-launched</guid>
				<title>Wikidot Search Launched</title>
				<link>http://piotr.gabryjeluk.pl/dev:wikidot-search-launched</link>
				<description>

&lt;p&gt;After about three months of indexing (because &lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:wikidot-is-big&quot;&gt;Wikidot is BIG&lt;/a&gt;) all content that&#039;s hosted on Wikidot, the time came to launch the new search system.&lt;/p&gt;
&lt;p&gt;I described the system extensively in blog post titled &lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock&quot;&gt;New search for Wikidot&#039;s gonna rock&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://www.wikidot.com/search:all&quot;&gt;The new search system&lt;/a&gt; replaced the old Google-powered one.&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Mon, 02 Mar 2009 21:28:30 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>After about three months of indexing (because <a href="http://piotr.gabryjeluk.pl/dev:wikidot-is-big">Wikidot is BIG</a>) all content that's hosted on Wikidot, the time came to launch the new search system.</p> <p>I described the system extensively in blog post titled <a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock">New search for Wikidot's gonna rock</a>.</p> <p><a href="http://www.wikidot.com/search:all">The new search system</a> replaced the old Google-powered one.</p> <div class="content-separator" style="display: none:"></div> <p>The main advantages over Google Search Engine that has been used till now are:</p> <ul> <li>search in public sites + those you are a member of</li> <li>semantic search: tags, title are more important when searching</li> <li>simple to sophisticated queries <ul> <li>"blog site:community" — search for anything with "blog" in it but only on site community.wikidot.com</li> <li>"tags:wikidot site:quake" — search for pages tagged "wikidot" on my site (quake.wikidot.com)</li> </ul> </li> <li>poor quality content filtered</li> </ul> <p>Things left to do:</p> <div class="image-container floatright"><img src="http://piotr.gabryjeluk.pl/local--thumbnail/small.jpg" alt="small.jpg" class="image" /></div> <ul> <li>add site preview (thumbnail) like the one on right to the search results</li> <li>explain in simple words the <a href="http://piotr.gabryjeluk.pl/search-syntax">search syntax</a></li> <li>promote good and/or active sites (give them higher rank)</li> <li>decrease delay from editing to updating search index (should be max 5 minutes)</li> <li>create custom search module searching in a bunch of (related) sites — this would be nice if you keep separate sites for different areas of a project like private site for project members and public one for project users. The search is intelligent enough to filter restricted items from search results if someone is not member of the private site the items come from.</li> </ul> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:wikidot-is-big</guid>
				<title>Wikidot is BIG</title>
				<link>http://piotr.gabryjeluk.pl/dev:wikidot-is-big</link>
				<description>

&lt;p&gt;As you may know I&#039;m implementing a new &lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock&quot;&gt;search engine for Wikidot&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This seemed quite easy at first having nice Lucene implementation in PHP — included in Zend Framework and indeed during tests it was fast, simple and powerful. But this was tested on about 100,000 documents (document is a Wikidot page or forum thread) and we have about 2,500,000 documents in Wikidot now. And this is where the problem begins.&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Sat, 10 Jan 2009 12:48:15 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>As you may know I'm implementing a new <a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock">search engine for Wikidot</a>.</p> <p>This seemed quite easy at first having nice Lucene implementation in PHP — included in Zend Framework and indeed during tests it was fast, simple and powerful. But this was tested on about 100,000 documents (document is a Wikidot page or forum thread) and we have about 2,500,000 documents in Wikidot now. And this is where the problem begins.</p> <div class="content-separator" style="display: none:"></div> <p>After indexing roughly 1,800,000 documents there were some problems with memory consumed by the indexing process (500&nbsp;MB merory limit was not enough in SOME cases).</p> <p>Even earlier I realized that the search times weren't good enough. This is why I implemented the searching part in Java, which is the native platform for the Lucene indexer. This sped things up.</p> <p>Do you think indexing a document in just a second is fast? I though this is a good result. Indexing a document takes about 0.2&nbsp;s when having small amount of documents in the index already. But when you have 400,000 documents in index, adding another document to the index takes about 0.4&nbsp;s. And having even this "good" indexing time (below a second), indexing the whole Wikidot would take at least a few days.</p> <p>This leads me to a conclusion, that Wikidot is really <span style="font-size:large;"><strong>BIG</strong></span>.</p> <p>A similar situation also applied to the user uploaded files. There was a problem of a limit of filesystem reached, which was about 32,000 directories max in a single directory. Having all user-uploaded files in a directory structure of one-directory-per-wiki, this resulted in a problem when having more than 32,000 wikis.</p> <p>Replicating this structure to another machine (also known as live-backup of user-uploaded files) was also quite a challenge, because we've reached a limit of directory watches in the kernel-level filesystem-monitoring system (inotify).</p> <p>It all shows, that things that seem easy are not necessarily easy because of the high scale of the Wikidot, which touches some limits on nearly every piece of software we use. But this is also a great chance to really test those projects and how they react to such a high load.</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:wikidot-search-ready-to-test</guid>
				<title>Wikidot Search Ready To Test</title>
				<link>http://piotr.gabryjeluk.pl/dev:wikidot-search-ready-to-test</link>
				<description>

&lt;p&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock&quot;&gt;Lucene-based search&lt;/a&gt; - a brand new feature has been being introduced to Wikidot code for quite a long time.&lt;/p&gt;
&lt;p&gt;After that time of developing and dealing with performance problems (searching the whole Wikidot in 3 seconds is too long!) it&#039;s time for test this thing!&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Thu, 18 Dec 2008 23:22:47 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p><a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock">Lucene-based search</a> - a brand new feature has been being introduced to Wikidot code for quite a long time.</p> <p>After that time of developing and dealing with performance problems (searching the whole Wikidot in 3 seconds is too long!) it's time for test this thing!</p> <div class="content-separator" style="display: none:"></div> <table style="margin:0; padding:0"> <tr> <td style="margin:0; padding:0"> <div id="toc"> <div id="toc-action-bar"><a href="javascript:;" >Fold</a><a style="display: none" href="javascript:;" >Unfold</a></div> <div class="title">Table of Contents</div> <div id="toc-list"> <div style="margin-left: 1em;"><a href="#toc0">Introduction</a></div> <div style="margin-left: 2em;"><a href="#toc1">Fresh install</a></div> <div style="margin-left: 2em;"><a href="#toc2">Existing install</a></div> <div style="margin-left: 1em;"><a href="#toc3">Features</a></div> <div style="margin-left: 1em;"><a href="#toc4">Test it!</a></div> <div style="margin-left: 1em;"><a href="#toc5">Extras</a></div> <div style="margin-left: 2em;"><a href="#toc6">Highlighting</a></div> <div style="margin-left: 2em;"><a href="#toc7">Need more performance or memory limit exhausted</a></div> <div style="margin-left: 1em;"><a href="#toc8">Summary</a></div> </div> </div> </td> </tr> </table> <h1><span>Introduction</span></h1> <p>The whole thing is about Search All Sites module for Wikidot Open Source. This entry is mainly for those of you that run their own Wikidot services. More info on getting Wikidot software run can be found on <a href="http://my-wd-local.wikidot.com/guide:ubuntu-8-04-with-lighttpd-install">Ed's site</a>.</p> <h2><span>Fresh install</span></h2> <p>After installing Wikidot Open Source from current version, you can instantly test the new Search All Sites module. Just navigate to <strong>/search:all</strong> page of your main wiki. Example: if your wiki farm runs on the domain <strong>mydomain.com</strong> and your main wiki is <strong>www.mydomain.com</strong>, navigate to <a href="http://www.mydomain.com/search:all">http://www.mydomain.com/search:all</a>.</p> <h2><span>Existing install</span></h2> <p>Updating to the new search engine from already installed version is quite tricky, because you need to pre-populate the search index (which is the actual file that is searched by the index, when looking for term user entered).</p> <p>You can try the following commands:</p> <ul> <li>obtain root priviledges</li> </ul> <div class="code"> <pre> <code>sudo su</code> </pre></div> <ul> <li>navigate to your Wikidot directory (it is /var/www/wikidot by default), update the code and run the lucene_bootstrap.php script as your lighttpd user</li> </ul> <div class="code"> <pre> <code>cd /var/www/wikidot svn update cd tests sudo -u www-data php lucene_bootstrap.php</code> </pre></div> <p>This command adds every page to the index (normally located at /var/www/wikidot/tmp/lucene_index). Once indexed a page can be searched (if the site containing the page is public or you're a member of the site). The command prints a dot for each 10 indexed items (item is every page and forum/comments thread).</p> <p>If this runs smoothly (i.e. no error, Segmentation fault at the end is OK, but <strong>memory exhausted is not OK</strong>) you have all your sites indexed and ready to search through.</p> <p><strong>When it fails:</strong> you can increase the max_memory setting in the corresponding php.ini file and re-run the command. There is no bad thing in running this command more than once as indexing a page always deletes the page from index before adding it again.</p> <p>Just go to <strong>/search:all</strong> location at your main wiki and search for some content.</p> <p>Also you need to update your crontab file. Add:</p> <div class="code"> <pre> <code>* * * * * www-data /var/www/wikidot/bin/job.sh UpdateLuceneIndexJob</code> </pre></div> <br /> to your /etc/crontab (assuming you have wikidot in /var/www/wikidot/). This will add an every-minute job indexing pages and threads queued to index when saving or changing public/private site state. <h1><span>Features</span></h1> <ul> <li>First of all the new search applies only to the <strong>Search All Sites</strong> i.e. <strong>Search This Site</strong> works in the old way.</li> <li>Search uses titles and tags intelligently <ul> <li>pages with the exact search phrase in the title are placed higher in the result list</li> <li>pages with tags matching search phrase are quite high in the result list</li> <li>pages with title matching search phrase are quite high in the result list</li> <li>pages with content matching search phrase are somewhere low in result list</li> <li>pages with parts of search phrase matching titles and tags can be higher in the result list than the pages having content matching even the exact phrase</li> <li>this all means: <strong>tags and titles are more important than content for the search engine</strong></li> </ul> </li> <li>You can narrow your search to only selected wikis <ul> <li>append <strong>site:site1,site2,site3</strong> (no spaces between them) to your search query. Example: search for "<strong>gabrys site:www,community</strong>" searches for <strong>gabrys</strong> in titles, tags and contents of pages and threads inside of sites <strong>www</strong>.yourdomain.com and <strong>community</strong>.yourdomain.com (supposing your Wikidot installation runs on yourdomain.com)</li> </ul> </li> <li>The search includes public sites plus sites you are a member of. Also the results from your sites are generally more relevant to the search engine (i.e. they appear higher than the results from other sites)</li> <li>The search results for given phrase for given user are cached (if memcached is used) for a few minutes. This makes the search even more smooth (no need to search the index again when user only switches the result page from 2 to 3 for example)</li> </ul> <h1><span>Test it!</span></h1> <p>If you don't run and don't want to run your own Wikidot installation, you can try the new features on the following site:</p> <p style="text-align: center;"><a href="http://www.wikicomplete.info/search:all">http://www.wikicomplete.info/search:all</a></p> <h1><span>Extras</span></h1> <h2><span>Highlighting</span></h2> <p>Like Google's way to highlight the searched words in Google-cached versions of result pages? Now you can add this feature to your Wikidot installation as well.</p> <p>Open <strong>conf/wikidot.ini</strong> file and append those lines:</p> <div class="code"> <pre> <code>[search] ; enables highlighting of search phrase in the resulting documents highlight = true</code> </pre></div> <p>This will highlight the words user searched for using:</p> <ul> <li>Google Search</li> <li>Yahoo Search</li> <li>Your Wikidot installation search:all page</li> </ul> <h2><span>Need more performance or memory limit exhausted</span></h2> <p>We experienced some low performance when searching through 2 millions of pages and threads of Wikidot.com. The search results were generated in about 3 seconds. This was not enough for us, so we manage to speed things up using the native <a href="http://lucene.apache.com">Java Lucene implementation</a> for searching the index. This works because we use <a href="http://framework.zend.com/manual/en/zend.search.lucene.html">PHP Lucene implementation</a> that is compatible with the Java's one. This means we can index page with PHP and search with Java. And we do it! If you want do this too (experiencing low search performance of getting memory exhausted error messages), just add the following lines to your <strong>conf/wikidot.ini</strong> file:</p> <div class="code"> <pre> <code>[search] ; enables the use of Java for searching use_java = true</code> </pre></div> <p>Notes</p> <ul> <li>if you already have [search] section in the <strong>conf/wikidot.ini</strong> file, just add the <tt>use_java = true</tt> line in the search section</li> <li>enabling Java for searching requires you to install java executable for your system. You should know how to do this (try sudo aptitude install openjdk-6-jre).</li> <li>you don't need any Java libraries as we already bundled everything needed in the .jar file. The Java source and Ant build script is located normally at the <tt>/var/www/wikidot/java</tt> directory (assuming you installed the wikidot in <tt>/var/www/wikidot</tt>.</li> </ul> <h1><span>Summary</span></h1> <p>Once we assure the search is stable and gives relevant results, we'll introduce it to the Wikidot.com service. I calculated that indexing all the sites would take about 3 days! But searching is done in less than 1 second (using the Java program).</p> <p>I'm looking for your comment on the features. Especially if you've tried them yourself!</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock</guid>
				<title>New search for Wikidot&#039;s gonna rock!</title>
				<link>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-s-gonna-rock</link>
				<description>

&lt;p&gt;New search module is on its way to Wikidot (&lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-first-preview&quot;&gt;I described this before&lt;/a&gt;). Now it&#039;s actively tested at &lt;a href=&quot;http://www.wikicomplete.info/search:all/&quot;&gt;Wiki Complete&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I must say it&#039;s gonna absolutely &lt;strong&gt;rock&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 09 Dec 2008 21:53:27 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>New search module is on its way to Wikidot (<a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-first-preview">I described this before</a>). Now it's actively tested at <a href="http://www.wikicomplete.info/search:all/">Wiki Complete</a>.</p> <p>I must say it's gonna absolutely <strong>rock</strong>!</p> <div class="content-separator" style="display: none:"></div> <p>First of all it'll give much more relevant results than before:</p> <ul> <li>title and tags are much important than pure page content for the indexer</li> <li>only 1-3 minutes delay of indexing/searching after an edit (compare to Google's a few days)</li> <li>searches for given phrase in all public sites <strong>plus</strong> all the sites you are member of (including private)!</li> <li>results from wikis you are member of appear generally higher in the result list, because the indexer gives them more relevance factor than to similar results from other sites.</li> </ul> <p>More features:</p> <ul> <li>really fast (typically search is done in just 2 seconds or even less if the result is cached)</li> <li>thumbnails of sites (feature to come)</li> <li>short activity of site information (feature to come)</li> <li>supply the sites to search through in the query with special keyword: <tt>site:X,Y,Z</tt> — searches for given phrase in three sites: X, Y, Z</li> <li>supply the sites to search through as the param to SearchSites module (feature and module to come yet)</li> </ul> <p><strong>Are you convinced yet?</strong></p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:is-wikidot-opensource-stable</guid>
				<title>Is Wikidot Open Source Stable?</title>
				<link>http://piotr.gabryjeluk.pl/dev:is-wikidot-opensource-stable</link>
				<description>

&lt;p&gt;Many of you ask about the development of Wikidot Open Source Edition. Is it stable? Can we safely use it.&lt;/p&gt;
&lt;p&gt;The thing that needs to be said is the Wikidot Source changes. It is not meant to be &lt;strong&gt;solid&lt;/strong&gt; stable yet. The development of .com and OS version is quite separate, although there is a flow in code in two ways:&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 07 Dec 2008 10:03:25 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>Many of you ask about the development of Wikidot Open Source Edition. Is it stable? Can we safely use it.</p> <p>The thing that needs to be said is the Wikidot Source changes. It is not meant to be <strong>solid</strong> stable yet. The development of .com and OS version is quite separate, although there is a flow in code in two ways:</p> <div class="content-separator" style="display: none:"></div> <ul> <li>fixes for bugs in .com having the way to OS</li> <li>many future improvements in .com are first tested and introduced in OS</li> </ul> <p>The latter introduces not-working revisions into the OS repository from time to time, but this is a product in creation (pre-1.0), so this just can happen.</p> <h1><span>Stability</span></h1> <p>The current version (revision 317) of Wikidot Open Source is used on <a href="http://www.wikicomplete.info">Wiki Complete</a>.</p> <p>It's rather stable, but you have to know the following:</p> <h2><span>Lighttpd only</span></h2> <p>It's for <a href="http://lighttpd.net/">lighttpd</a> only (we want support for Apache back in 1.0).</p> <h2><span>INI file for configuration</span></h2> <p>File <tt>conf/GlobalProperties.php</tt> is no longer needed and needs to be deleted. But first have a look at <tt>conf/wikidot.ini</tt> and try to migrate any custom settings you had in <tt>conf/GlobalProperties.php</tt>.</p> <p>The full-blown verbose example of wikidot.ini is stored in <a href="http://svn.wikidot.org/svn/showfile.svn?path=%2fwikidot1%2ftrunk%2fconf%2ffull-example-of-wikidot.ini&amp;revision=HEAD&amp;name=wikidotorg">conf/full-example-of-wikidot.ini</a>. This is not meant to even work, it's just every possible option listed and described. If the option has an default value it is also mentioned. Use the file as the reference for your wikidot.ini file.</p> <h2><span>HTML user-uploaded files hosting disabled by default</span></h2> <p>Since it could be dangerous in certain cases, we have disabled serving of HTML files with the default Wikidot installation. However Internet Explorer (6 and 7) ignores the hint to display HTML files as source, so it's not really a solution.</p> <p>The solution that really works is having a totally separate domain for hosting uploaded files only (at wikidot it is wdfiles.com). The settings for this would be:</p> <div class="code"> <pre> <code>[security] upload_separate_domain = true upload_domain = your-different-domain.com ; having different domain for uploads we can safely enable user-uploaded HTML files serving upload_restrict_html = false</code> </pre></div> <h2><span>Search All Wikis module</span></h2> <p>The SearchAllModule (allowing to search all wikis) is very experimental now. If you have already any content, that needs to be searched, don't upgrade (or you can restore the previous php/modules/search/SearchAllModule.php file after the upgrade).</p> <p>The module is being <a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-first-preview">migrated</a> to Zend_Search_Lucene and require doing an initial indexing of all sites.</p> <p>Also it's very probable the search mechanisms changes any time soon, so I just don't recommend using the new version of module.</p> <p><strong>Search This Wiki</strong> feature is unaffected!</p> <h2><span>Search Highlight</span></h2> <p>In the meantime of developing new search engine for Wikidot, we came into an idea to highlight phrases user searches for (using internal Wikidot search or Google/Yahoo search engine).</p> <p>You can enable it in the wikidot.ini file appending:</p> <div class="code"> <pre> <code>[search] highlight = true</code> </pre></div> <p>Search features (the new SearchAllModule and Search Highlighting) is already tested on <a href="http://www.wikicomplete.info/">Wiki Complete</a>, however there are some known limitations it the search module, that need to be improved yet.</p> <h1><span>Upgrading</span></h1> <p>As a matter of fact, we don't supply any "upgrade" script. This unfortunately doesn't mean that there are no problems with that.</p> <p>Once we release the first stable version, we'll supply upgrade scripts for each incremental upgrade i.e. 1.0 -&gt; 1.1 -&gt; 1.2 and so.</p> <p>This is because it would be really easier if we know what version EXACTLY we're moving from to what EXACT version. Otherwise we would end up with scripts that can work but can break something. If you're in trouble with upgrading, ask on our <a href="http://groups.google.com/group/wikidot?hl=en">dev-list</a>.</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-first-preview</guid>
				<title>New Search For Wikidot First Preview</title>
				<link>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot-first-preview</link>
				<description>

&lt;p&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot&quot;&gt;New search for Wikidot&lt;/a&gt; is almost ready. It&#039;s using the &lt;strong&gt;Zend_Search_Lucene&lt;/strong&gt; solution (check the &lt;a href=&quot;http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot&quot;&gt;previous post&lt;/a&gt; for details).&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 05 Dec 2008 22:15:53 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p><a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot">New search for Wikidot</a> is almost ready. It's using the <strong>Zend_Search_Lucene</strong> solution (check the <a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot">previous post</a> for details).</p> <div class="content-separator" style="display: none:"></div> <p>Active testing is done on the <a href="http://wikicomplete.info/">Wiki Complete</a> wiki farm based on Wikidot software. The same one, that you can test experimental <a href="http://piotr.gabryjeluk.pl/dev:search-highlight-on-wikicomplete-info">search phrase highlighter</a> at.</p> <p>You can check the result at the <a href="http://www.wikicomplete.info/search:all/">search all wikis page</a> on Wiki Complete.</p> <p>I look forward any bug reports.</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:search-highlight-on-wikicomplete-info</guid>
				<title>Search Highlight On Wikicomplete Info</title>
				<link>http://piotr.gabryjeluk.pl/dev:search-highlight-on-wikicomplete-info</link>
				<description>

&lt;p&gt;I just introduced this nice feature to the &lt;a href=&quot;http://wikicomplete.info/&quot;  &gt;Wiki Complete&lt;/a&gt; wiki farm (based on Wikidot software).&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 05 Dec 2008 01:55:52 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>I just introduced this nice feature to the <a href="http://wikicomplete.info/" >Wiki Complete</a> wiki farm (based on Wikidot software).</p> <h1><span>How does it work?</span></h1> <p>Here are the Google results for <em>female superheroes wikicomplete</em> query:</p> <p><iframe src="http://www.google.com/search?q=female+superheroes+wikicomplete" align="" frameborder="" height="300px" scrolling="" width="100%" class="" style=""></iframe></p> <p>locate any link from wikicomplete.info and click. The words <em>female, superheroes, wikicomplete</em> should be highlighted using different colors.</p> <p>The same mechanism is used for local searches.</p> <p><span class="printuser avatarhover"><a href="http://www.wikidot.com/user:info/hartnell" ><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/10/10978/a16.png" alt="hartnell" style="background-image:url(http://www.wikidot.com/userkarma.php?u=10978)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/10/10978/a16.png" alt="hartnell" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=10978,sizingMethod='scale')"/><![endif]--></a><a href="http://www.wikidot.com/user:info/hartnell" >hartnell</a></span> (the main admin of the Wiki Complete) really likes this feature. There is some chance we'll introduce it to Wikidot.com! After the <a href="http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot">new search for wikidot</a> is done.</p> <h1><span>Tech</span></h1> <p>If you wonder how did I do this, here is the answer. I used <a href="http://framework.zend.com/manual/en/zend.search.lucene.searching.html#zend.search.lucene.searching.highlighting">Zend_Search_Lucene</a> utility for highlighting. It was all quick and easy after parsing the HTTP_REFERER looking for the actual search query.</p> <h1><span>Summary</span></h1> <p>This feature was implemented on some random pages I walked through. I hope you like it being implemented on WikiComplete!</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
					<item>
				<guid>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot</guid>
				<title>New Search For Wikidot</title>
				<link>http://piotr.gabryjeluk.pl/dev:new-search-for-wikidot</link>
				<description>

&lt;p&gt;As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.&lt;/p&gt;
&lt;p&gt;by &lt;span class=&quot;printuser avatarhover&quot;&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;&lt;!--[if gte IE 7]&gt;&lt;!--&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common--images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;background-image:url(http://www.wikidot.com/userkarma.php?u=2462)&quot; /&gt;&lt;!--&lt;![endif]--&gt;&lt;!--[if lt IE 7]&gt;&lt;img class=&quot;small&quot; src=&quot;http://www.wikidot.com/common&amp;#45;&amp;#45;images/avatars/2/2462/a16.png&quot; alt=&quot;Gabrys&quot; style=&quot;filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod=&#039;scale&#039;)&quot;/&gt;&lt;![endif]--&gt;&lt;/a&gt;&lt;a href=&quot;http://piotr.gabryjeluk.pl/profile2:2462&quot; target=&quot;_blank&quot;&gt;Gabrys&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
</description>
				<pubDate>Wed, 03 Dec 2008 20:58:09 +0000</pubDate>
												<content:encoded>
					<![CDATA[
						 <p>As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.</p> <p>We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.</p> <p>Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:</p> <ul> <li>using external service — prestige and dependence</li> <li>displaying ads on search results — for those who don't use AdBlock</li> <li>pages get indexed after some significant time</li> <li>only public wikis can be indexed</li> </ul> <p>One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no <strong>big/real</strong> impact on how Google treats different portions of pages.</p> <p>This leads to a conclusion, that we need a search engine.</p> <p>We would like:</p> <ul> <li>to treat tags as more important than the regular content</li> <li>not to search from Wikidot.com static elements (like the footer on every page)</li> <li>allow searching all wikis available for given user <ul> <li>all public wikis</li> <li>all private wikis that the user is a member of</li> </ul> </li> </ul> <p>Coming to technical details, one could say, we just need a generic <strong>full text search engine</strong>. We can use one available in our storage system or one of dedicated search-only engines.</p> <h1><span>Tsearch</span></h1> <p><strong>Tsearch</strong> — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.</p> <h1><span>Lucene</span></h1> <p><strong>Lucene</strong> — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:</p> <ul> <li>application pulls some <em>documents</em> to the search <em>index</em> <ul> <li><em>document</em> is a webpage in our situation</li> <li><em>index</em> is a datastore to be used when searching</li> </ul> </li> <li>user queries the <em>index</em> with a query <ul> <li>a bunch of <em>documents</em> is returned in the order of relevance</li> <li>the <em>documents</em> returned are more or less the same as the documents pulled to the index before</li> </ul> </li> </ul> <p>This mechanism requires populating index by application</p> <ul> <li>updating the index every now and then</li> <li>updating document on some change — like page edit</li> </ul> <p>and requires us to define some functions that deal with search results</p> <ul> <li>the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document <ul> <li>this makes the index smaller and faster</li> <li>this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID</li> </ul> </li> </ul> <h1><span>Nutch</span></h1> <p><strong>Nutch</strong> would be a different — more <em>Googlish</em> — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:</p> <p>The main advantage of using Nutch is that as a search result we get a formatted HTML document</p> <ul> <li>with links to items found</li> <li>with context of the search phrase quoted</li> <li>the search phrase words outlined in some way</li> </ul> <p>This is very similar to what we get searching for some phrase with Google.</p> <p>What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.</p> <p>Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.</p> <p>There is <strong>OpenSearch</strong> project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.</p> <h1><span>Zend_Search_Lucene</span></h1> <p>There is also a quite nice thing around: <strong>Zend_Search_Lucene</strong>. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).</p> <p>It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.</p> <p>I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.</p> <p>There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).</p> <p>The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500&nbsp;MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.</p> <h1><span>Sphinx</span></h1> <p>When I have asked my friend about full text search engines he recommends, he pointed out <strong>Sphinx</strong> — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.</p> <p>Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.</p> <p>It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.</p> <h1><span>The choice</span></h1> <p>There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).</p> <p>by <span class="printuser avatarhover"><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank"><!--[if gte IE 7]><!--><img class="small" src="http://www.wikidot.com/common--images/avatars/2/2462/a16.png" alt="Gabrys" style="background-image:url(http://www.wikidot.com/userkarma.php?u=2462)" /><!--<![endif]--><!--[if lt IE 7]><img class="small" src="http://www.wikidot.com/common&#45;&#45;images/avatars/2/2462/a16.png" alt="Gabrys" style="filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src=http://www.wikidot.com/userkarma.php?u=2462,sizingMethod='scale')"/><![endif]--></a><a href="http://piotr.gabryjeluk.pl/profile2:2462" target="_blank">Gabrys</a></span></p> 
				 	]]>
				</content:encoded>							</item>
				</channel>
</rss>