Groupsite's Server Down Time February 3, 2011

At 5:30AM EST on Thursday, Feb 3, one of our internal file servers


noticed a performance slowdown and attempted to automatically resolve


the issue.  Unfortunately this greatly exacerbated the issue, and the


server was completely unresponsive for an hour.  Unfortunately, this


cascaded into errors with our security database which caused the


entire website to be completely unresponsive.  When the server came


back up at 6:30AM EST, we had to slowly restore one of our databases


to avoid any data corruption or data loss.  The data recovery was


successful but slow; it did not finish until about 10:30AM EST.  Once


that was complete, the site came back online.


 



Steps we are or have already taken to prevent this from happening again:


 


* Our internal file server will be obsolete soon, as all of our


uploaded files will be cloud hosted by Amazon S3.


* Our security system is being greatly simplified during our migration


to a new hosting service, which should be complete within the month.


There should be no way for one back-end server crash to prevent the


other servers from responding to web requests.


* The monitoring system which aggravated the original performance


issue, shutting down the server, will be replaced with a simpler


system at our new hosting service.


* We have synchronized replicas for our databases, and will prepare


them for use in the case of a database failure so our downtime can be


measured in minutes rather than hours.


* We will create an externally hosted status page, which we will keep


up to date during any downtime event, planned or otherwise.


 




We appreciate your patience and welcome any feedback!







Sincerely,





The Groupsite.com Team

Liked By: