04 Mar 2009 22:22
TAGS: failure plans wikidot
Today Wikidot encountered a small break in its operation. After over an hour, we managed to get everything back to normal. The part that took the longest time was (who would guess?) filesystem check (after 280 days without check).
Normally starting up a machine takes a minute or two and is almost indistinguishable from a network outage or some other temporary failures. But with Wikidot having as many files that our users upload the operation of checking the filesystem takes long time.
Not even because of today's crash, I must confess, we have plans of decentralizing the service and moving it to more distributed environment to let it be (even) more reliable. Even including the crash we have still very high uptime, that would satisfy just everyone. But not us. We aim at having 100% (or more ;-) ) uptime, and make things totally fault-tolerant.
I must say we are really really sorry for what happened today but in the same time I must ensure that we really care about you — the Users — as many of you have noticed for sure. I hope you still believe in us :).