Piotr Gabryjeluk blog

July News

1248795663|%e %B %Y

Some of you may be more used to me posting more often, than in last time.
Some of you may wonder why I stopped blogging.

Brussels

Last month was full of adventures. It started 1st of July with me going to Brussels meat our friend to talk about wikipedia-like site about art. We're going to help this man build the most complete site about art using Wikidot software!

BTW, this was my first flight in lifetime. Quite a strange feeling, but generally fine.

Forms

I was working on some nice technical and UI improvements to Wikidot, that is crucial for the art site (but really really nice for Wikidot as well, like forms for editing, entering and viewing structured data to wiki pages).

Search issues

That week was also spent on some massive Wikidot.com search engine tweaks. A stupid one-line bug, which was not exporting proper LC_ALL environmental variable in indexing script, caused many sites that used Asian or East-European languages to be not indexed (most notably the great ИСТОРИЈСКА БИБЛИОТЕКА). At first we though that we can re-index the broken sites, but our re-indexing mechanism was way too slow (would last for weeks for all broken sites).

Pieter then challenged me. He said he can index whole Wikidot in 6 hours. I thought it's not even possible, but then I started to work on that and I managed to index the whole Wikidot in less than 2 hours without indexing tags at first. Then with tags, it took 2 hours and 10 minutes or so. That was damn fast!

Inspired by this and an accident of disk full error on /var partition of our webserver (but this is why we keep user-uploaded files and other important things on separate disks), I also rewrote the incremental indexer, to work in similar way to the whole-Wikidot-re-indexer.

search-api reindex

If you care about some technical details:

  • all search operations are issued with use of search-api, a separate program that can:
    • re-index whole Wikidot
    • queue indexing page/thread
    • queue deleting page/thread
    • queue re-indexing site
    • flush queue
  • search-api is written in Python
  • search-api uses PyLucene - a native Java Lucene library binded to CPython objects with PyJCC. Compiled with GNU Java Compiler to native code (like C programs), this binding has improved performance over using Lucene with Sun's Java.
  • before rewriting it to only-Python, search-api was written in BASH and was a wrapper to:
    • java -jar searchApiHelper.jar search "phrase-to-search"
    • php search-api-helper.php flush
  • search-api also takes care of file locking to assure that
    • only one process tries to modify the index
    • items are added to queue one-after-another
    • when doing some big index modification (read: full re-index) queue is not flushed (so that after the re-index all changes are applied to new index)
    • when flushing queue takes more time, and cron tries to run more flushing processes, they simply end (so only one process flushes the queue at a time)

Union of Rock Festival

Just after week spent in Brussels in nice hotel I went to Węgorzewo, Mazury (Poland biggest lakes distinct) to have fun on rock music festival. Unfortunately, the music level was not very impressive, so I mainly enjoyed the atmosphere on the camping area.

The weather was not great. It was wet everywhere, the ground was covered in 20 centimeters of mud and it was hard to walk around without getting dirty. But during the first day of being there, I learned to do that.

Improved workflow at Wikidot

Some of you noticed, that recently we started to work more efficiently, but this is not quite true. In fact we work as efficiently as before, but we are better organized, and have better priorities on tasks. Also we keep track of what we do, so we can then tell what we've done. So for us, this is a little more work of "documenting" our work (so maybe we work even less efficiently than before?), but for the outside world, we make more noise (in a positive meaning) around that. So basically, people know what we do, what we are going to do, when they can expect changes and most importantly, they understand why some feature request is being postponed. This is (and was) because we have more important things to do, but before they couldn't tell it.

SquarkSquark turned into a professional project manager, that manages our time. pieterhpieterh decided to talk to the Community and listen to their complaints (he reads or at least skims every post on Community forums). He tells Łukasz what needs to be done, Łukasz knows when we will have time to do this. This way communication inside Wikidot improved. Also we (michal frackowiakmichal frackowiak and me) no longer look on Community forums (some of you may regret), but this allows us to concentrate on our work.

The work continues

As I mentioned before, we want to introduce a great feature to Wikidot, which is forms. But the implementation now concentrates on the open source version of Wikidot software (once it's ready, working and tested we'll copy the feature to the Wikidot.com service).

aptitude install wikidot

As forms is a huge change, I started to prepare a good ground for it and closed most important bugs in Wikidot open source and I'm about to start making Ubuntu packages for it to allow even-simpler installation on Debian-based systems. Now the installation involves only 6 child-easy steps and in fact can be done by copying&pasting a few commands.

Yesterday's party

Yesterday I went to met some old-school-times friends in the heart of the city. It was meant to be a meeting for "a beer or two" but evolved into beer and dancing till morning. That was first time I get a morning bus (not even the first) to my home just after partying.

It was such a great fun and great folks I met.

Summary

I hope with this long blog post (but divided into friendly sections ;) ) I recompensed long period of not-posting anything here.

Comments: 0, Rating: 1

Lucene Replaces Local Searches

1245781558|%e %B %Y

As some of you know, some time ago I worked on a new Wikidot search. It is used for the Search all sites page.

This work was done, because searching the whole Wikidot database was far from being fast. At first we used Google Custom Search Engine to solve the problem, but we wanted to be independent from it. Also we wanted to include search results that are accessible by the person that searches but not by search engines (like private sites).

This worked really good. The only problem was long time spent to display the results. It was like 20 seconds, which was far better than when using previous search, but much worse than Google. This was strange, because from previous tests, we calculated the average search time should be about a second or two.

It started to be clear, when we noticed that only 20-30 searches per day are performed. The index is quite big, and it needs to be cached in RAM to work with sufficient performance.

Today I did more tests under heavy load and it seams Lucene can handle big number of queries. When users search often, the index is partially or even fully cached by the filesystem and searches are really quick!

But our main problem to solve today was slow local search a.k.a. "search this site". Moreover many concurrent queries were degrading database performance (not only for searching), so we decided to enable Lucene for local searches as well.

I must say it works really nice, fast and has a nice set of syntax tricks you can do with it, for example you can search for pages with something in tags. Just search for youtube tags:embed. This would search for pages matching youtube (in tags, title or content) and with embed in tags. If no such pages are found, partial matches are also returned, like: pages matching youtube (but with no embed in tags) or pages with embed in tags, (but not matching youtube).

To sum up, new search is faster, gives more accurate results, saves the database performance (which was the main goal) and allows nicer syntax than the old one.

Comments: 7, Rating: 1

Wikidot Search Launched

1236029310|%e %B %Y

After about three months of indexing (because Wikidot is BIG) all content that's hosted on Wikidot, the time came to launch the new search system.

I described the system extensively in blog post titled New search for Wikidot's gonna rock.

The new search system replaced the old Google-powered one.

The main advantages over Google Search Engine that has been used till now are:

  • search in public sites + those you are a member of
  • semantic search: tags, title are more important when searching
  • simple to sophisticated queries
    • "blog site:community" — search for anything with "blog" in it but only on site community.wikidot.com
    • "tags:wikidot site:quake" — search for pages tagged "wikidot" on my site (quake.wikidot.com)
  • poor quality content filtered

Things left to do:

small.jpg
  • add site preview (thumbnail) like the one on right to the search results
  • explain in simple words the search syntax
  • promote good and/or active sites (give them higher rank)
  • decrease delay from editing to updating search index (should be max 5 minutes)
  • create custom search module searching in a bunch of (related) sites — this would be nice if you keep separate sites for different areas of a project like private site for project members and public one for project users. The search is intelligent enough to filter restricted items from search results if someone is not member of the private site the items come from.

Comments: 2, Rating: 0

Wikidot is BIG

1231591695|%e %B %Y

As you may know I'm implementing a new search engine for Wikidot.

This seemed quite easy at first having nice Lucene implementation in PHP — included in Zend Framework and indeed during tests it was fast, simple and powerful. But this was tested on about 100,000 documents (document is a Wikidot page or forum thread) and we have about 2,500,000 documents in Wikidot now. And this is where the problem begins.

After indexing roughly 1,800,000 documents there were some problems with memory consumed by the indexing process (500 MB merory limit was not enough in SOME cases).

Even earlier I realized that the search times weren't good enough. This is why I implemented the searching part in Java, which is the native platform for the Lucene indexer. This sped things up.

Do you think indexing a document in just a second is fast? I though this is a good result. Indexing a document takes about 0.2 s when having small amount of documents in the index already. But when you have 400,000 documents in index, adding another document to the index takes about 0.4 s. And having even this "good" indexing time (below a second), indexing the whole Wikidot would take at least a few days.

This leads me to a conclusion, that Wikidot is really BIG.

A similar situation also applied to the user uploaded files. There was a problem of a limit of filesystem reached, which was about 32,000 directories max in a single directory. Having all user-uploaded files in a directory structure of one-directory-per-wiki, this resulted in a problem when having more than 32,000 wikis.

Replicating this structure to another machine (also known as live-backup of user-uploaded files) was also quite a challenge, because we've reached a limit of directory watches in the kernel-level filesystem-monitoring system (inotify).

It all shows, that things that seem easy are not necessarily easy because of the high scale of the Wikidot, which touches some limits on nearly every piece of software we use. But this is also a great chance to really test those projects and how they react to such a high load.

Comments: 3, Rating: 0

page 1 of 3123next »

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License