10 Jun 2009 18:09
TAGS: events fsdb push thoughts
OK, this is a very short note, because I'm trying to learn to my Friday exam and would prefer to pass it.
During last time, I though it would be nice to have at least two things. One is a total revolution and probably won't be ready in near time, but second is quite easily implementable.
Events and PUSH
The first thing is to have a totally event-driven Linux and web services built on top of that. What does it mean? Every system event — file created, TCP connection set up, terminal window popped out, AC adapter plugged in, IM contact went on-line, … should be broadcasted to every application that registered for this event.
There should be two kind of events — synchronous and asynchronous. Synchronous means that each event listener can stop event (or delay it). For example messenger client can listen for suspend event, react by disconnecting you from IM networks and then let the suspend continue. Asynchronous are just some signals that are informational only, that are not to be interrupted — like "new wifi network detected".
What to build on top of that? Imagine the following scheme:
- file is created in /home/ directory signal is immediately send
- signal is caught by an application that calculates disk usage. Disk usage changes to 98%. Signal is immediately broadcasted
- the signal is caught by "low-space-notifier" application, that immediately sends you an email
- email is received and put into your INBOX
- if you have your mail client connected to mail server, you get the notification about the new mail immediately
- mail client shows you a nice notification about disk usage going to 98% on some other machine.
parsing everything including sending email and other things is less than 5 seconds. So you get the message "just as it's sent".
Some of the concepts in the example are already implemented, like inotify for file changes, IMAP push mode, but the problem is that in each case, this is solved individually, and what I would want is a UNIVERSAL and FAST solution for all event-driven computing.
On the other hand we (now) have polling. So Linux polls the CD drive for new disks every 2 seconds, mail client poll server each minute, RSS aggregators ask servers about news every 15 minutes and so.
The problem comes if you want to forward a message using polling. For example I have a mail account on some server and I want some messages to be forwarded to other server. Polling the first mailbox each 1 minute and doing processing mails makes the mail appear in the second mailbox 1 minute later (in the worst case). If I poll for mails in the second mailbox as well, the message is delayed for 2 minutes. If I want to pass the message to next mailboxes, the distribution time raises.
The solution for that is to poll services very very often (like every 10 seconds). 10 seconds multiplied by 4-5 mailboxes (in chain) is like 1 minute delay, which is in most cases acceptable. But this is just not needed to ask server for new messages each 10 seconds! How much much better would be to ask server, to notify about the message just as they come!
All problems with polling are just about "what polling time is OK". Polling to often means additional server work for answering "no, nothing new yet", and polling to rarely means outdated information. Either way polling is bad. And as it may seem inefficient to send messages as they come, it can be much more efficient than dealing with clients asking for new messages each 10 seconds.
This is first dream. Event-driven world, that delivers messages as fast as it can (not a new idea, it is used in old SMTP for example).
The second dream is having a filesystem as powerful as a database. Why?
- There are good tools to deal with filesystems: ls, mv, rm, cp, ln
- You can serve filesystems with WWW, FTP, SAMBA, NFS servers
- Files are supported nicely in all programming languages I know
What database features I would like to have in filesystems:
1. Automatic backlinks. In database I have cars that belong to user:
CREATE TABLE Car ( name: STRING; year: INT; user_id: FOREIGN KEY (User); );
but I can use that relation to User in the reverse:
SELECT * FROM Car WHERE user_id = 7;
where 7 is my user_id. This works fast (even having millions of cars). If I have it in filesystem:
$ ls cars/porsche -l name year user -> ../users/gabrys
I can do something like:
$ cd cars $ for car in *; do ls -l $car | grep 'user -> ../users/gabrys' 1>/dev/null && echo $car; done porsche
but this is counter-effective. It just goes through ALL cars and see if they have user linking to users/gabrys.
2. sorting/grouping by given field/property
3. paging (page 3, 10 items per page)
5. full text searching — this is rarely used in databases as it's always better idea to have separate application for that, like Lucene, so this is not to be needed in the filesystem.
How to do this
I believe all (or nearly all) of this can be done having a simple fs as the backend and doing additional integrity tasks after given operations. Most notably, when you link from car to user:
$ ln -s ../users/gabrys cars/porsche/owner
There should be also something like this done automatically:
$ ln -s ../cars/porsche users/gabrys/cars/
Then listing Gabrys' cars would be:
$ ls users/gabrys/cars
And this would list links to the cars that have links to gabrys. So it's just automatic backlink mechanism. It's just that easy, and that powerful.
Other things that make this nice is that you can easily back up the "database" (by having a special snapshot-enabled filesystem as a backend), NFS-mount it, replicate, rsync, encrypt, split to different hosts and easily debug with standard filesystem tools. Also there would be no big magic in that, so everyone could work with it.
The problem yet to think of is how to define the schema of data to be stored in the "database". But maybe the schema is even not needed by having a few conventions of special directories for example (like .backlinks directory storing backlinks, .fields directory storing field values and that kind of stuff).
Wish me luck with exams!