At 5:30AM EST on Thursday, Feb 3, one of our internal file servers
noticed a performance slowdown and attempted to automatically resolve
the issue. Unfortunately this greatly exacerbated the issue, and the
server was completely unresponsive for an hour. Unfortunately, this
cascaded into errors with our security database which caused the
entire website to be completely unresponsive. When the server came
back up at 6:30AM EST, we had to slowly restore one of our databases
to avoid any data corruption or data loss. The data recovery was
successful but slow; it did not finish until about 10:30AM EST. Once
that was complete, the site came back online.
Steps we are or have already taken to prevent this from happening again:
* Our internal file server will be obsolete soon, as all of our
uploaded files will be cloud hosted by Amazon S3.
* Our security system is being greatly simplified during our migration
to a new hosting service, which should be complete within the month.
There should be no way for one back-end server crash to prevent the
other servers from responding to web requests.
* The monitoring system which aggravated the original performance
issue, shutting down the server, will be replaced with a simpler
system at our new hosting service.
* We have synchronized replicas for our databases, and will prepare
them for use in the case of a database failure so our downtime can be
measured in minutes rather than hours.
* We will create an externally hosted status page, which we will keep
up to date during any downtime event, planned or otherwise.