Lucene Replaces Local Searches

23 Jun 2009 18:25

As some of you know, some time ago I worked on a new Wikidot search. It is used for the Search all sites page.

This work was done, because searching the whole Wikidot database was far from being fast. At first we used Google Custom Search Engine to solve the problem, but we wanted to be independent from it. Also we wanted to include search results that are accessible by the person that searches but not by search engines (like private sites).

This worked really good. The only problem was long time spent to display the results. It was like 20 seconds, which was far better than when using previous search, but much worse than Google. This was strange, because from previous tests, we calculated the average search time should be about a second or two.

It started to be clear, when we noticed that only 20-30 searches per day are performed. The index is quite big, and it needs to be cached in RAM to work with sufficient performance.

Today I did more tests under heavy load and it seams Lucene can handle big number of queries. When users search often, the index is partially or even fully cached by the filesystem and searches are really quick!

But our main problem to solve today was slow local search a.k.a. "search this site". Moreover many concurrent queries were degrading database performance (not only for searching), so we decided to enable Lucene for local searches as well.

I must say it works really nice, fast and has a nice set of syntax tricks you can do with it, for example you can search for pages with something in tags. Just search for youtube tags:embed. This would search for pages matching youtube (in tags, title or content) and with embed in tags. If no such pages are found, partial matches are also returned, like: pages matching youtube (but with no embed in tags) or pages with embed in tags, (but not matching youtube).

To sum up, new search is faster, gives more accurate results, saves the database performance (which was the main goal) and allows nicer syntax than the old one.

Comments: 7

PHP as FastCGI backend and Lighttpd

15 Jun 2009 20:59

Wikidot + Lighttpd + PHP5

At Wikidot we use PHP5 as FastCGI backend to Lighttpd light-and-fast webserver. It works like this:

  • there are a few hundreds of php5-cgi processes (name is cgi, but they also support FastCGI mode) running and waiting to be used
  • lighttpd (only one needed!) process manages the network connections to all the clients and once the request is ready serves a static file or forwards the request to one of PHP backends processes.

We used to use internal Lighttpd FastCGI process manager, meaning the lighttpd processes actually used to start the PHPs.

Problems

We encountered some known problems of 500 (server side) errors appearing after some random time, especially under a high traffic. The typical message appearing at the Lighttpd's error.log was:

<some date>: (mod_fastcgi.c.2494) unexpected end-of-file (perhaps the fastcgi process died): pid: ...

There are plenty of reports on this in both Lighttpd's and PHP's forums, bug trackers and even some blogs.

Workarounds

We managed to write some hacky scripts that detected the situation and restarted the backends when needed. The reaction was so quick, that almost no-one noticed the error, but damn, this is not how WE solve problems.

A blind try

We decided to give spawn-fcgi a shot. What is it? It is a program that spawns FastCGI backends (independently from Lighttpd server). Why trying it? I've read somewhere, that it works more reliably than the internal Lighttpd spawner. What's interesting is that this program comes from lighttpd package, so we're in family anyway. It's mainly intended to run the FastCGI backends from different user than the webserver user or to run them on different machine(s) than the webserver machine. This can be used naturally for some smart load-balancing.

The only problem of this solution we encountered was internal limit of number of processes to spawn by a single process which was 256 (hardcoded, fixed in next versions). But at the same time, we decided to build a few FastCGI bridges (each spawning ~200 PHPs) anyway so that was no longer a problem for us.

What was quite surprising (but honestly, I deeply believed in this), our problems with 500 server errors and PHP disappeared. This configuration works for about 2 weeks now with absolutely no hacky scripts involved and no restarting needed. Cool.

Why I wrote this

I wrote this short note just for the record and to let other people know, that using spawn-fcgi instead of the internal Lighttpd's FastCGI spawner might solve their problems with PHP (FastCGI) and 500 internal server errors.

Hope this helps someone.

Comments: 1

New Things, New Ideas

10 Jun 2009 18:09

OK, this is a very short note, because I'm trying to learn to my Friday exam and would prefer to pass it.

During last time, I though it would be nice to have at least two things. One is a total revolution and probably won't be ready in near time, but second is quite easily implementable.

Events and PUSH

The first thing is to have a totally event-driven Linux and web services built on top of that. What does it mean? Every system event — file created, TCP connection set up, terminal window popped out, AC adapter plugged in, IM contact went on-line, … should be broadcasted to every application that registered for this event.

There should be two kind of events — synchronous and asynchronous. Synchronous means that each event listener can stop event (or delay it). For example messenger client can listen for suspend event, react by disconnecting you from IM networks and then let the suspend continue. Asynchronous are just some signals that are informational only, that are not to be interrupted — like "new wifi network detected".

What to build on top of that? Imagine the following scheme:

  • file is created in /home/ directory signal is immediately send
  • signal is caught by an application that calculates disk usage. Disk usage changes to 98%. Signal is immediately broadcasted
  • the signal is caught by "low-space-notifier" application, that immediately sends you an email
  • email is received and put into your INBOX
  • if you have your mail client connected to mail server, you get the notification about the new mail immediately
  • mail client shows you a nice notification about disk usage going to 98% on some other machine.

parsing everything including sending email and other things is less than 5 seconds. So you get the message "just as it's sent".

Some of the concepts in the example are already implemented, like inotify for file changes, IMAP push mode, but the problem is that in each case, this is solved individually, and what I would want is a UNIVERSAL and FAST solution for all event-driven computing.

On the other hand we (now) have polling. So Linux polls the CD drive for new disks every 2 seconds, mail client poll server each minute, RSS aggregators ask servers about news every 15 minutes and so.

The problem comes if you want to forward a message using polling. For example I have a mail account on some server and I want some messages to be forwarded to other server. Polling the first mailbox each 1 minute and doing processing mails makes the mail appear in the second mailbox 1 minute later (in the worst case). If I poll for mails in the second mailbox as well, the message is delayed for 2 minutes. If I want to pass the message to next mailboxes, the distribution time raises.

The solution for that is to poll services very very often (like every 10 seconds). 10 seconds multiplied by 4-5 mailboxes (in chain) is like 1 minute delay, which is in most cases acceptable. But this is just not needed to ask server for new messages each 10 seconds! How much much better would be to ask server, to notify about the message just as they come!

All problems with polling are just about "what polling time is OK". Polling to often means additional server work for answering "no, nothing new yet", and polling to rarely means outdated information. Either way polling is bad. And as it may seem inefficient to send messages as they come, it can be much more efficient than dealing with clients asking for new messages each 10 seconds.

This is first dream. Event-driven world, that delivers messages as fast as it can (not a new idea, it is used in old SMTP for example).

FsDb

The second dream is having a filesystem as powerful as a database. Why?

  • There are good tools to deal with filesystems: ls, mv, rm, cp, ln
  • You can serve filesystems with WWW, FTP, SAMBA, NFS servers
  • Files are supported nicely in all programming languages I know

What database features I would like to have in filesystems:

1. Automatic backlinks. In database I have cars that belong to user:

CREATE TABLE Car (
  name: STRING;
  year: INT;
  user_id: FOREIGN KEY (User);
);

but I can use that relation to User in the reverse:

SELECT * FROM Car WHERE user_id = 7;

where 7 is my user_id. This works fast (even having millions of cars). If I have it in filesystem:

$ ls cars/porsche -l
name
year
user -> ../users/gabrys

I can do something like:

$ cd cars
$ for car in *; do ls -l $car | grep 'user -> ../users/gabrys' 1>/dev/null && echo $car; done
porsche

but this is counter-effective. It just goes through ALL cars and see if they have user linking to users/gabrys.

2. sorting/grouping by given field/property
3. paging (page 3, 10 items per page)
4. searching
5. full text searching — this is rarely used in databases as it's always better idea to have separate application for that, like Lucene, so this is not to be needed in the filesystem.

How to do this

I believe all (or nearly all) of this can be done having a simple fs as the backend and doing additional integrity tasks after given operations. Most notably, when you link from car to user:

$ ln -s ../users/gabrys cars/porsche/owner

There should be also something like this done automatically:

$ ln -s ../cars/porsche users/gabrys/cars/

Then listing Gabrys' cars would be:

$ ls users/gabrys/cars

And this would list links to the cars that have links to gabrys. So it's just automatic backlink mechanism. It's just that easy, and that powerful.

Other things that make this nice is that you can easily back up the "database" (by having a special snapshot-enabled filesystem as a backend), NFS-mount it, replicate, rsync, encrypt, split to different hosts and easily debug with standard filesystem tools. Also there would be no big magic in that, so everyone could work with it.

The problem yet to think of is how to define the schema of data to be stored in the "database". But maybe the schema is even not needed by having a few conventions of special directories for example (like .backlinks directory storing backlinks, .fields directory storing field values and that kind of stuff).

Wish me luck with exams!

Comments: 0

Wire Wrapped Jewelry

03 Jun 2009 09:18

I would like to present new high quality earrings made by my fiancée.

They are called Forget-Me-Not.

medium.jpg

They are effect of a great design and many hours of work. The great look comes from wire-wrapping — a technique hard enough to not be used by many artists. Also what counts is very high precision, which makes earrings look gorgeous.

Carefully chosen color of the nacre makes them just perfect.

If you want to buy them, contact me at lp.kooltsal|rtoip#lp.kooltsal|rtoip (or leave comment on this page). I can prepare them and send to EU or even US. Payment with PayPal or international bank transfer. Including posting to EU, price would be €20 (20 Euro).

See the whole collection on this site (in Polish language).

UPDATE: I forgot to mention about dimensions of them. The total length (height) of them (including hooks) is 50 mm. Also we decided to even lower the price to 15 Euro including shipping to EU just to make your decision easier!

This is a great opportunity. No-one else has earrings like this, guarantied!

UPDATE: earrings are sold. Look at similar ones at the following link biżuteria wire wrapping.

Comments: 2

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License