Thursday, September 3, 2009

Disaster Recovery Testing - Epic Fail #1

As I've mentioned before, my big project for this month is disaster recovery testing. A few things have changed since our last comprehensive test of our backup practices and we are long overdue. Because of this, I expect many "failures" along the way that will need to be remedied. I expect our network documentation to be lacking, I expect to be missing current versions of software in our disaster kit. I know for a fact that we don't have detailed recovery instructions for several new enterprise systems. This is why we test - to find and fix these shortcomings.

This week, at the beginning stages of the testing we encountered our first "failure". We've dubbed it "Epic Failure #1" and its all about those backup tapes.

A while back our outside auditor wanted us to password protect our tapes. We were running Symantec Backup Exec 10d at the time and were happy to comply. The password was promptly documented with our other important passwords. Our backup administrator successfully tested restores. Smiles all around.

We faithfully run backups daily. We run assorted restores every month to save lost Word documents, quickly migrate large file structures between servers, and to correct data corruption issues. We've had good luck with with the integrity of our tapes. More smiles.

Earlier this week, I load up the first tape I need to restore in my DR lab. I typed the password to catalog the tape and it tells me I have it wrong. I typed it again, because it's not an easy password and perhaps I had made a mistake. The error message appears, my smile did not.

After poking in the Backup Exec databases on production and comparing existing XML catalog files from a tape known to work with the password, we conclude that our regular daily backup jobs simply have a different password. Or at least the password hash is completely different, yet this difference is repeated across the password protected backup jobs on all our production backup media servers. Frown.

After testing a series of tapes from different points in time from different servers, we came the the following disturbing conclusion: The migration of our Backup Exec software from 10d to 12.5, which also required us to install version 11 as part of the upgrade path, mangled the password hashes on the pre-existing job settings. Or uses a different algorithm, or something similar with the same result.

Any tapes with backup jobs that came from the 10d version of the software use the known password without issue. And any new jobs that are created without a password (since 12.5 doesn't support media passwords anymore) are also fine. Tapes that have the "mystery password" on them are only readable by a media server that has the tape cataloged already, in this case the server that created it. So while they are useless in a full disaster scenario, they work for any current restorations we need in production. We upgraded Backup Exec just a few months ago, so the overall damage is limited to a specific time frame.

Correcting this issue required our backup administrator to create new jobs without password protection. Backup Exec 12.5 doesn't support that type of media protection anymore (it was removed in version 11) so there is no obvious way to remove the password from the original job. Once we have some fresh, reliable backups off-site I can continue with the disaster testing. We'll also have to look into testing the new tape encryption features in the current version of Backup Exec and see if we can use those to meet our audit requirements.

The lesson learned here was that even though the backup tapes were tested after the software upgrade, they should have been tested on a completely different media server. While our "routine" restore tasks showed our tapes had good data, it didn't prove they would still save us in a severe disaster scenario.

No comments:

Post a Comment

MS ITPro Evangelists Blogs

More Great Blogs