Groupsite's Server Down Time February 3, 2011

by Celeste Wooten

February 4, 2011

At 5:30AM EST on Thursday, Feb 3, one of our internal file servers

noticed a performance slowdown and attempted to automatically resolve

the issue. Unfortunately this greatly exacerbated the issue, and the

server was completely unresponsive for an hour. Unfortunately, this

cascaded into errors with our security database which caused the

entire website to be completely unresponsive. When the server came

back up at 6:30AM EST, we had to slowly restore one of our databases

to avoid any data corruption or data loss. The data recovery was

successful but slow; it did not finish until about 10:30AM EST. Once

that was complete, the site came back online.

Steps we are or have already taken to prevent this from happening again:

* Our internal file server will be obsolete soon, as all of our

uploaded files will be cloud hosted by Amazon S3.

* Our security system is being greatly simplified during our migration

to a new hosting service, which should be complete within the month.

There should be no way for one back-end server crash to prevent the

other servers from responding to web requests.

* The monitoring system which aggravated the original performance

issue, shutting down the server, will be replaced with a simpler

system at our new hosting service.

* We have synchronized replicas for our databases, and will prepare

them for use in the case of a database failure so our downtime can be

measured in minutes rather than hours.

* We will create an externally hosted status page, which we will keep

up to date during any downtime event, planned or otherwise.

We appreciate your patience and welcome any feedback!

Sincerely,

The Groupsite.com Team

Liked By:

Setup Information