Woohoo! We’ve made it to 99%+ uptime for our core systems since we’ve begun keeping track (about 4 months at time of this post) — which isn’t just “not bad”, it’s great, considering we were doing live migrations, emergency data recovery, and cut-over to our new cluster all during this time.
Apart from some extra downtime shown for our gateway – it was blocking pings, and thus artificially was showing as down – we have done pretty well. EngSocSrv was the worst hit for true uptime, since it both had a failed disk earlier this year (knocking the club sites offline for a while) and was subject to downtime from our damaged ECF uplink. Both of these types of failure have been done-away with now that all systems are RAIDed+mirrored+backed up daily and also supported by a dual-wire redundant uplink.
Our first real performance reliability test came last week during the F!rosh events. Despite being hammered by high-volumes of traffic for people checking out the F!rosh week schedule, and people deliberately trying to overload the servers during HavengerScunt, everything stayed up.
Let’s cross our fingers that the good fortune carries on through midterms and finals, too!