27 Aug 2009 13:03
TAGS: crash dev wikidot
Last night we made Wikidot online again after a great crash.
Wikidot was down for about 12 hours and the time it was down was full of work for us. We got a few things that could be broken starting from recent changes of the Wikidot software, hardware failure, high load-related kernel bugs or limitations to connection number or maximum possible number or file descriptors.
The problem is Wikidot kind of worked, so some people had their sites loading, some other not and getting "500" errors. We didn't want to stop it, but at some point Wikidot was completely unusable. We switched the database to the other machine, but this was not the solution, then we switched all Wikidot traffic to the machine, still no good, we switched the software to some previous version, but this still seemed bad.
Finally we worked out, there was a site, that had so big traffic, that it killed anything else (and itself as well). When we temporarily disabled it, the whole Wikidot started to work nicely again. Then Michał made some improvements for the high traffic site serving and the situation is stable again.
In the middle of everything, we had huge problems with our hosting company and their service called Portable IP addresses. It seems that switching DNS is much more reliable that using Portable IPs that took hours to switch (and were supposed to take seconds to switch)! DNS switching time was 15 minutes.
We learned a lot from the situation. Hardware upgrade postponed from really long time needs to be done quite quickly. We need more servers, to see which element breaks. For example if database server has high load, we know we need to tune database settings. If we have all on one massive server and one brick on it crashes, it usually causes all the server overloaded and this causes other bricks to crash as well, so it's hard too tell what the real problem is.
Another thing is that we see our users want information on what happens. It's bad when Wikidot crashes, but it's even worse, when it crashes and they have no information about this.
So, the next time a similar disaster happens we'll update on each technical detail possible, to let you know, that we know it's broken and we work hard to fix it. Other thing is, we plan having more fail-over servers in case something dies.
Thank you all for using Wikidot, it's a great pleasure working (and fixing things) for you!