August, 2012

The Prime Hosting RAID meltdown brought up servers using months-old backups

RAID 6 meltdownHundreds of customers of The United Kingdom based Prime Hosting suffered website outages following a disk meltdown, reports The Register. The downtime was due to the failure of several hard disk drives, which degraded the storage array to the point where it ceased to function. The downtime at the Manchester-based hosting reseller began at 5am on 31 July, and tree days later some sites are still down.

James Smith, executive director at parent company M247, explained that the fault lay with the hardware: "The problems were caused by the failure of three hard drives in a RAID6 array, therefore losing parity."

"After one drive failed, the rebuild process staggered into problems" - continue James Smisth. "Two other drives started exhibiting a high rate of media error during a rebuild process to replace a failed drive. The second drive failed at 15 percent rebuilding. At this point, the RAID array was severely degraded and could not tolerate any further failures. Unfortunately the third drive with high media errors then also went in to a failed state during efforts to take a more recent replicated copy of data. The replication of data had 790GB of data to sync, it managed 150GB before failing."

What is not known is the timeframe between the first disk failure and when action was initiated to rectify it.

This unfortunate incident starkly illustrates that RAID was always about higher availability, and that regular offline backups are never completely avoidable--Prime Hosting did have some backup, fortunately.

As a result, several customers were upset to find their restored websites had reverted to a backup that was months old.

Moreover, having three HDDs fail in the same array in such a short span of time also indicates possible problems with that particular batch of HDDs. Indeed, some storage experts have in the past called for HDDs to be purchased from different batches to prevent a situation where multiple drives decide to cease operation at the same time.