Data Recovery Scenario - Tech Support and Data Loss
Tech Support and Data Loss
Electronic data is part of all of our lives, some is business and some is pleasure. The loss of either can devastate you financially, emotionally or both. There are several ways that data can become inaccessible to you. Most of these failures are recoverable, some of them are not.First, let's look at recoverable failures. These can be physical or logical in nature and if approached correctly, 100% of your data can be saved.
- Physical device failure of external or internal electronic components
- Logical corruption of the operating system files
- Logical corruption of the manufacturers system area or firmware
- Logical corruption of file system components other than damage to the tables
- Physical damage to the actual media holding the binary data
- Overwritten data
Product support, in any industry, is essential and adds value to the product. Customer satisfaction ranks highly with manufacturers but sometimes turns into complete "dissatisfaction" with irreversible results. This is widespread in the computer industry due to bad suggestions given by those that should know best.
The most common mistake made involves logically corrupted personal computers wherein the support technician will instruct the user to do a system restore with complete disregard to user files. This is done by either running the restore partition from the hard drive or utilizing a restore CD/DVD. When this routine is initiated, it will bring the computer back to its original factory configuration…minus the users' data.
These routines will typically delete the primary partition, recreate and format the partition and reinstall the operating system (OS) and any furnished applications. When this happens, the table (MFT, FAT, B+ tree, etc.) used to keep track of your personal files and their possible fragments, is overwritten. This does not completely overwrite your personal files and some can be recovered with third-party software if you're knowledgeable. These tools use file signatures for recovery of small, non-fragmented files. Larger files such as home videos and email files will typically be unrecoverable.
Business Critical Data
Businesses fail for various reasons but a company should never fail due to lost data… It happens more often than you would think. Over the years, we have seen several businesses fail due to unrecoverable data. Statistics show that a company totally reliant on a SQL database will typically founder within 20 business days without access to the records.
How can this happen to a company? Let's look at the most common scenario. Corporate servers are configured with fault tolerant storage known as RAID (Redundant Array of Independent Drives). The most common configuration is RAID level 5 utilizing distributed parity. With this configuration, one drive can fall offline and the distributed parity can be calculated, on-the-fly, and the user data will be presented as if nothing is wrong, this is known as "critical state".
Running critical, the server's performance will be degraded but will continue to function. In most cases, this condition will be recognized and the suspect drive will be replaced and the RAID will rebuild by design. If running critical and a second drive falls offline, there is insufficient parity information to calculate and the RAID will collapse and all data on the array will be inaccessible.
Post Failure
In the failure scenario above, we have two members that have fallen offline for some reason. This could be from a controller glitch, backplane failure or physical failure of the members themselves.
It is absolutely critical that the sequence of failure and the failure mechanisms are known before moving forward. This is where the trouble starts as typically the administrator is functioning under duress and scrambling for a quick resolve. He or she doesn't know for sure the sequence of failure or why and obviously had no knowledge of the state of the array or it wouldn't be down at this point.
First move, power cycle the setup and let's see what happens! Bingo, we have one of the suspect drives back online but the server can't see the volume, time to call support. Now we're on the phone with a technician that has no more knowledge of the events than the administrator of the system. However, this technician does have knowledge of the equipment and possibly the original configuration.
The Fatal Mistake
The administrator informs the technician of the current state of the array…all drives are online except one. Not a problem, right, this is a RAID level 5 configuration and is fault tolerant so the array just needs to be rebuilt with a new drive. Support overnights a replacement drive to the administrator and is received the following day and inserted into the enclosure. The technician guides the administrator through the RAID controller routines and forces the rebuild of the array.
The volume now appears normal with the folder structure intact but the vast majority of the files are corrupted - how could this be? We know that the members of the array are physically fine so that's not the problem. The problem was caused by incorporating a drive member into the rebuild that contained stale raw data and parity information. Yes, the first drive that went offline and was sitting dormant came back online.
We see this scenario in approximately 30% of the RAID 5 configurations submitted for recovery. The rebuild process is irreversible and the data was destroyed by overwrite; there is nothing that can be done at this point. Technical support holds no liability for this devastating loss and will fall back to contractual agreements stating so.
Think First!
If in the hands Data Recovery Services, in the above situation, there was a 98% chance of getting 100% of the data back…viable data, folder and file structure intact. With 31 years tenure in the data recovery industry, we have seen hundreds of projects with data that is lost forever due to overwrite. Here are some things that can help you avoid this type of loss:
- Maintaining a verified backup of critical data
- Keeping a close eye on those "taken-for-granted" fault tolerant systems
- Have 100% knowledge of the sequence of failure before trying anything
- Stress to technical support the value of the data to curb any overzealous procedures
- Most importantly, STOP where you're at if you're uncomfortable or unsure