Monday, August 31, 2015
Business Continuity and the Cloud
Wednesday, August 19, 2015
Summer Reads!
- (Read) Containers: Docker, Windows and Trends
- (Do) 18 Steps for End-to-End IaaS Provisioning in the Cloud with Azure Resource Manager (ARM), PowerShell and Desired State Configuration (DSC)
- (Watch) Best practices for DR as a service
- (Read) What's New in Windows Server 2016 Technical Preview 3
- Raw Tech - Lots of Dev stuff, but a pretty interesting assortment.
- Azure Documentation Shorts - Quick videos covering the how-to of what's documented for Azure.
Monday, August 10, 2015
TechNet on Tour - Disaster Recovery!
- 9/1 - Seattle, WA
- 9/3 - San Francisco, CA
- 9/22 - Houston, TX
- 9/29 - Charlotte, NC
- 9/30 - Malvern, PA
- 10/6 - Indianapolis, IN
- 10/7 - Tampa, FL
- 10/8 - New York, NY
- 10/14 - Irvine, CA
- 10/16 - Dallas, TX
Tuesday, February 25, 2014
I’ve Got Nothing: The DR Checklist
Disclaimer: I love technology, I think that cloud computing and virtualization are paramount to increasing the speed you can get your data and services back online. But when disaster strikes, you can bet I’m reaching for something on paper to lead the way. You do not want your recovery plans to hinge on finding the power cable for that dusty laptop that is acting as the offline repository for your documentation. It’s old school, but it works. If you have a better suggestion than multiple copies of printed documentation, please let me know. Until then, finding a ring binder is my Item #0 on the list. (Okay, Hyper-V Recovery Manager is a pretty cool replacement for paper if you have two locations, but I'd probably still have something printed to check off...)
The Checklist
- Backups - I always start at the backups. When your data center is reduced to a pile of rubble the only thing you may have to start with is your backups, everything else supports turning those backups into usable services again. Document out your backup schedule, what servers and data are backed up to what tapes or sets, how often those backups are tested and rotated. Take note if you are backing up whole servers as VMs, or just the data, or both. (If you haven’t yet, read Brian’s post on the value of virtual machines when it comes to disaster recovery.)
- Facilities - Where are you and your backups going to come together to work this recovery magic? Your CEO’s garage? A secondary location that’s been predetermined? The Cloud? List out anything you know about facilities. If you have a hot site or cold site, include the address, phone numbers and access information. (Look at Keith’s blog about using Azure for a recovery location.)
- People - Your DR plan should include a list of people who are part of the recovery process. First and foremost, note who has the right to declare a disaster in the first place. You need to know who can and can’t kick off a process that will start with having an entire set of backups delivered to an alternate location. Also include the contact information for the people you need to successfully complete a recovery - key IT, facilities and department heads might be needed. Don’t forget to include their backup person.
- Support Services - Do you need to order equipment? Will you need support from a vendor? Include names and numbers of all these services and if possible, include alternatives outside of your immediate area. Your local vendor might not be available if the disaster is widespread like an earthquake or weather incident.
- Employee Notification System - How do you plan on sharing information with employees about the status of the company and what services will be available to use? Your company might already have something in place - maybe a phone hotline or externally hosted emergency website. Make sure you are aware of it and know how you can get updates made to the information.
- Diagrams, Configurations and Summaries - Include copies of any diagrams you have for networking and other interconnected systems. You'll be glad you have them for reference even if you don't build your recovery network the same way.
- Hardware - Do you have appropriate hardware to recover to? Do you have the networking gear, cables and power to connect everything together and keep it running? You should list out the specifications of the hardware you are using now and what the minimum acceptable replacements would be. Include contact information for where to order hardware from and details about how to pay for equipment. Depending on the type of disaster you are recovering from, your hardware vendor might not be keen on accepting a purchase order or billing you later. If you are looking at Azure as a recovery location, make sure to note what size of compute power would match up.
- Step-By-Step Guides - If you’ve started testing your system restores, you should have some guides formed. If your plans include building servers from the ground up, your guides should include references to the software versions and licensing keys required. When you are running your practice restores, anything that makes you step away from the guide should be noted. In my last disaster recovery book, I broke out the binder in sections, in order of recovery with the step-by-steps and supporting information in each area. (Extra credit if you have PowerShell ready to automate parts of this.)
- Software - If a step in your process includes loading software, it needs to be available on physical media. You do not want to have to rely on having a working, high-speed Internet connect to download gigs of software.
- Clients - Finally, don’t forget your end users. Your plan should include details about how they will be connecting, what equipment they would be expected to use if the office is not available and how you will initially communicate with them. Part of your testing should include having a pilot group of users attempt to access your test DR setup so you can improve the instructions they will be provided. Chances are, you’ll be too busy to make individual house calls. (For more, check out Matt’s post on using VDI as a way to protect client data.)
No matter the result of your testing, it will be better than the last time. Go forth and be prepared.
Oh, one more thing, if you live in a geographic area where weather or other "earthly" disasters are probable, please take some time to do some DR planning for your home as well. I don't care who you work for, if your home and family aren't secure after a disaster you certainly won't be effective at work. Visit www.ready.gov or www.redcross.org/prepare/disaster-safety-library for more information.
This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/
Tuesday, February 18, 2014
Question: Is there value in testing your Disaster Recovery Plan?
There are a few reasons you need to regularly test your recovery plans… I’ve got my top three.
- Backups only work if they are good.
- Your documentation is only useful if you can follow it.
- You are soft and easily crushed.
Everyone knows the mantra of “backup, backup, backup” but you also have to test those backups for accuracy and functionality. I’m not going to beat this one endlessly, but please read an old post of mine - “Epic Fail #1” to see how backups can fail in spectacular, unplanned ways.
Documentation
Simply put, you need good documentation. You need easy to locate lists of vendors, support numbers, configuration details of machines and applications, notes on how “this” interacts with “that”, what services have dependencies on others and step by step instructions for processes you don’t do often and even those you DO do everyday.
When under pressure to troubleshoot an issue that is causing downtime, it’s likely you’ll loose track of where to find information you need to successfully recover. Having clean documentation will keep you calm and focused at a time you really need to have your head in the game.
Realistically, your documentation will be out of date when you use it. You won’t mean for it to be, but even if you have a great DR plan in place, I’ll bet you upgraded a system, changed vendors, or altered a process almost immediately after your update cycle. Regular review of your documents is a valuable part of testing, even if you don’t touch your lab.
My personal method is to keep a binder with hard copies of all my DR documentation handy. Whenever I change a system, I make a note on the hard-copy. Quarterly, I update the electronic version and reprint it. With the binder, I always have information handy in case the electronic version is not accessible and the version with the handwritten notes is often more up to date with the added margin notes. Even something declaring a section “THIS IS ENTIRELY WRONG NOW” can save someone hours of heading down the wrong path.
You
No one wants to contemplate their mortality, I completely understand. (Or maybe you just want to go on vacation without getting a call half way through. Shocker, right?) But if you happen to hold the only knowledge of how something works in your data center, then you are a walking liability for your company. You aren't securing your job by being the only person with the password to the schema admin account, for example. It only takes one run in with a cross-town bus to create a business continuity issue for your company that didn't even touch the data center.
This extends to your documentation. Those step-by-step instructions for recovery need to include information and tips that someone else on your team (or an outside consultant) can follow without having prior intimate knowledge of that system. Sometimes the first step is “Call Support, the number is 800-555-1212” and that’s okay.
The only way to find out what others don’t know is to test. Test with tabletop exercises, test with those backup tapes and test with that documentation. Pick a server or application and have someone who knows it best write the first draft and then hand it to someone else to try to follow. Fill in the blanks. Repeat. Repeat again.
A lot of this process requires only your time. Time you certainly won’t have when your CEO is breathing down you neck about recovering his email.
Additional Resources
This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/
If you are ready to take things further, check out Automated Disaster Recovery Testing with Hyper-V Replica and PowerShell - http://blogs.technet.com/b/keithmayer/archive/2012/10/05/automate-disaster-recovery-plan-with-windows-server-2012-hyper-v-replica-and-powershell-3-0.aspx
Tuesday, February 11, 2014
Disaster Recovery for IT Pros: How to Plan, What are the Considerations?
I've always thought that being an IT Pro is one of the most powerful, powerless jobs in existence. We have our fingers on the pulse of what makes our businesses run, we have access to ALL THE DATA and we have the power to control access and availability to the resources. But we are often slaves to the business - we are responsible for providing the best up times, the best solutions and the best support we can. Facing budgets we can't always control while trying to explain technology to people who don't have time to understand it.
So where do you begin when tasked with updating or creating your disaster recovery plan? The good news is you don't need money or lots of extra hardware to start good disaster recovery planning - grab the note-taking tools of your choice and start asking questions.
Here are my three main questions to get started:
- What is the most important application or services in each business unit or for the business overall?
- How much downtime is acceptable?
- How much data loss is acceptable?
This post is one of many in disaster recovery series being penned by the IT Pro Evangelists at Microsoft. As the series progresses, you'll find the complete list on Brian Lewis's blog post, "Blog Series: DR Planning for IT Pros." We will cover tools and applications you can consider in you planning and get you started with using them. They have various costs, but until you know your goal, you won't know what tools will help and can't argue the budget.
So let's put the pencil to the paper and start answering those three questions.
Start at the top: Go to upper management, have your CTO or CIO to pull together a leadership meeting and rank what systems the business units use and what they think is needed first. Get them to look at the business overall and determine how much downtime is too much, how quickly do they want services recovered and how much data they are willing to lose.
When it comes to determining your internal SLA you do need to know what scenario you are planning for. Preparing for a riot that blocks access to your office is different than an earthquake that renders your data center a steaming pile of rubble. Ultimately, you want different plans for different scenarios, but if you must start somewhere, go with the worst case so you can cover all your bases.
But what if you can't get leadership to sit down for this, or they want you to come to the table first with draft plan. Just GUESS.
Seriously, you have your hand on the data center, you know the primary goals of your business. If it was your company, what do you think you need to recover first? Use your gut to get you started. Look at your data center and pick out some of the key services that likely need to be recovered first to support the business needs. Domain controllers, encryption key management systems, infrastructure services like DNS and DHCP, communication tools and connectivity to the Internet might float to the top.
Sort the List: People want email right away? Great, that also needs an Internet connection and access to your authentication system, like Active Directory. People want the document management system or CRM or some in-house app with a database back-end? Fabulous, you need your SQL Servers and maybe the web front-end or the server that supports the client application.
Gather Your Tools: Look at your list of loosely ranked servers, devices and appliances and start building a shopping list of things you need to even start recovery. I always start with the "steaming pile of rubble" scenario, so my list starts like this:
- Contact information for hardware and software vendors
- Contact information and locations where my data center can function temporarily
- List of existing hardware and specifications that would need to be met or exceeded if ordering new equipment for recovery
- List of operating systems and other software, with version details and support documentation
- Names of the people in the company that would be crucial to the successful recovery of the data center
Congratulations! You are closer to a usable DR plan than you were before you started and we've just scratched the surface. Disaster Recovery planning is often pushed off until tomorrow. Whatever you have today, be it an outdated document from your company leadership, server documentation that is a year old, or NOTHING, you can take time each day to improve it. How you plan is going to depend on the needs of your organization and you won't be able to complete the process in a silo, but you can get started.
I really enjoy disaster recovery planning. It's challenging, it's ever changing and I haven't even mentioned how things like virtualization, Hyper-V Replica and Azure can be some of the tools you use. Stay tuned for more in the series about how some of those things can come into play. Sometimes the hardest part about disaster recovery planning is just getting started.
***
Wednesday, May 30, 2012
End of the Month Round Up
I won't be speaking this year, but that just gives me more time to attend some of the great sessions - I'll be concentrating on Active Directory in Server 2012, Exchange 2010, PowerShell and some System Center.
If you are hoping for something more local to your home town, check out the Windows Server 2012 Community Roadshow. US locations will include Houston, Chicago, Irvine, New York and San Jose, just to name a few. Microsoft MVPs will be presenting the content, so don't miss out a free chance to prepare for the release of Server 2012.
Another notable event that's upcoming is the World IPv6 Launch. Check out which major ISPs and web companies are turning on IPv6 for the duration.
Finally, if you are looking to make some improvements to your personal, cloud-based storage and file management for your personal computers, take a look at SugarSync. I've been using it for several years and it's been an easy way for me to access files from multiple computers and keep everything synced and backed up. I've even got a link for a referral if you'd like to try it out.
Thursday, January 26, 2012
Recovering Exchange 2010 - Notes from the Field
Check out this TechNet article with the basics for recovering Exchange 2010. However, there are some little tips that would be helpful, especially when you might be working under a stressful situtation to restore your mail system.
- Make sure you know where your install directory is if Exchange isn't installed in the default location. If you don't have it written down as part of your disaster recovery documentation, you can get that information out of Active Directory using ADSIEDIT.
- Make sure you know the additional syntax for "setup /m:RecoverServer" switch. If you need to change the target directory the proper syntax is /t:"D:\Microsoft\Exchange\V14" or whatever your custom path happens to be.
- If you are planning on using the /InstallWindowsComponents switch to save some time with getting your IIS settings just right, make sure you've preinstalled the .NET Framework 3.5.1 feature set first.
- Don't forget to preinstall the Office 2010 Filter Packs. You don't need them to complete the setup, but you will be reminded about them as a requirement.
- Make sure you install your remote agent (or whatever components are necessary) for your backup software. Once the Exchange installation is restored, you'll need to mark your databases as "This database can be overwritten by a restore" so that you can restore the user data.
Thursday, October 20, 2011
Playing IT Fast and Loose
It also helps that our business model doesn't require selling things to the public or answering to many external "customers". Which puts us in the interesting position where its almost okay if we are down for a day or two, as long as we can get things back to pretty close to where they were before they went down. That also sets up to make some very interesting decisions come budget time. They aren't necessarily "wrong", but they can end up being awkward at times.
For example, we've been working over the last two years to virtualize our infrastructure. This makes lots of sense for us - our office space requirements are shrinking and our servers aren't heavily utilized individually, yet we tend to need lots of individual servers due to our line of business. When our virtualization project finally got rolling, we opted to us a small array of SAN devices from Lefthand (now HP). We've always used Compaq/HP equipment, we've been very happy with the dependability of the physical hardware. Hard drives are considered consumables and we do expect failures of those from time to time, but whole systems really biting the dust? Not so much.
Because of all the factors I've mentioned, we made the decision to NOT mirror our SAN array. Or do any network RAID. (That's right, you can pause for a moment while the IT gods strike me down.) We opted for using all the space we could for data and weighed that against the odds of a failure that would destroyed the data on a SAN, rendering entire RAID 0 array useless.
Early this week, we came really close. We had a motherboard fail on one of the SANs, taking down our entire VM infrastructure. This included everything except the VoIP phone system and two major applications that have not yet been virtualized. We were down for about 18 hours total, which included one business day.
Granted, we spent the majority of our downtime waiting for parts from HP and planning for the ultimate worst - restoring everything from backup. While we may think highly of HP hardware overall, we don't think very highly of their 4-hour response windows on Sunday nights. Ultimately, over 99% of the data on the SAN survived the hardware failure and the VMs popped back into action as soon as the SAN came back online. We only had to restore one non-production server from backup after the motherboard replacement.
Today, our upper management complemented us on how we handled the issue and was pleased with how quickly we got everything working again.
Do I recommend not having redundancy on your critical systems? Nope.
But if your company management fully understands and agrees to the risks related to certain budgeting decisions, then as a IT Pro your job is to simply do the best you can with what you have and clearly define the potential results of certain failure scenarios.
Still, I'm thinking it might be a good time to hit Vegas, because Lady Luck was certainly on our side.
Wednesday, March 30, 2011
Tomorrow is World Backup Day
If you are a systems admin, you probably already have a backup solution in place at the office or for your clients. Take some time tomorrow to check in on those processes to make sure you aren't missing something important and that they are working the way you expect.
At home, check on or implement a solution for your important files and photos on your home computers. It can be as simple as purchasing a portable drive or using a cloud based solution. I'm a SugarSync fan myself. If you want to check out SugarSync for yourself, use this referral code and get some bonus free space.
With the proper backup solution in place, your home laptop can be almost instantly replaceable with no worries. I recently reinstalled the OS on my netbook and was able to sync all my data files right back on with SugarSync. It's easy and helps me sleep better at night!
Learn more about World Backup Day at http://www.worldbackupday.net/
Monday, November 29, 2010
The How and Why of an ImageRight Test Environment
ImageRight has an interesting back-end architecture. While it's highly dependant on Active Directory for authentication (if you use the integrated log on method), the information about what other servers the application server and the client software should interact with is completely controlled with database entries and XML setup files. Because of this you can have different ImageRight application servers, databases and image stores all on the same network with no conflicts or sharing of information. Yet, you don't need to provide a separate Active Directory infrastructure or network subnet.
While our ultimate goal was to provide a test/dev platform for our workflow designer, we also used this exercise as an opportunity to run a "mini" disaster recovery test so I could update our recovery documentation related to this system.
To set up a test environment, you'll need at least one server to hold all your ImageRight bits and pieces - the application server service, the database and the images themselves. For testing, we don't have enough storage available to restore our complete set of images, so we only copied a subset. Our database was a complete restoration, so test users will see a message about the system being unable to locate documents that weren't copied over.
I recommend referring to both the "ImageRight Version 5 Installation Guide" and the "Create a Test Environment" documents available on the Vertafore website for ImageRight clients. The installation guide will give you all the perquisites need to run ImageRight and the document on test environments has details of what XML files need to be edited to ensure that your test server is properly isolated from your production environment. Once you've restored your database, image stores and install share (aka "Imagewrt$), its quick and easy to tweak the XML files and get ImageRight up and running.
For our disaster recovery preparations, I updated our overall information about ImageRight, our step-by-step guide for recovery and burned a copy of our install share to a DVD so it can be included in our off-site DR kit. While you can download a copy of the official ImageRight ISO, I prefer to keep a copy of our expanded "Imagewrt$" share instead - especially since we've added hotfixes to the version we are running, which could differ from the current ISO available online from Vertafore.
Because setting up the test enviroment was so easy, I could also see a use where some companies may want to use alternate ImageRight environments for extra sensitive documents, like payroll or HR. I can't speak for the additional licensing costs of having a second ImageRight setup specificially for production, but it's certainly technicially possible if using different permissions on drawers and documents doesn't meet the business requirements for some departments.
Monday, December 7, 2009
If You Build It, Can They Come?
One issue with end user access was problems with the Terminal Services ActiveX components on Windows XP SP3. This is disabled by default as part of a security update in SP3. This can usually be fixed with a registry change which I posted about before, however that requires local administrative privileges that not all our testing users had. There are also ActiveX version issues if the client machine is running an XP service pack that is earlier than SP3.
Administrative privileges also caused some hiccups with one of our published web apps that required a Java plug-in. At one point, the web page required a Java update that could only be installed by a server administrator and this caused logon errors for all the users until that was addressed.
In this lab setting, we had also restored our file server to a different OS. Our production file server is Windows 2000 and in the lab we used Windows 2008. This resulted in some access permission issues for some shared and "home" directories. We didn't spend any time troubleshooting the problem this time around, but when we do look to upgrade that server or repeat this disaster recovery test we know to look into the permissions more closely.
Users also experienced trouble getting Outlook 2007 to run properly. I did not have issues when I tested my own -there were some dialog boxes that needed to be address before it ran for the first time to confirm the username and such. While the answers to those boxes seem second nature to those of us in IT, we realized that will need to provide better documentation to ensure that users get email working right the first time.
In the end, detailed documentation proved to be the most important aspect of rolling this test environment out to end users. In the event of a disaster, it's likely that our primary way of sharing initial access information would be by posting instructions to the Internet. Providing easy to follow instructions that include step-by-step screenshots that can be followed independently are critical. After a disaster, I don't expect my department will have a lot of time for individual hand-holding for each user that will be using remote access.
Not only did this project provide an opportunity to update our procedures used to restore services, it showed that it's equally as important to make sure that end users have instructions so they can independently access those services once they are available.
Wednesday, October 14, 2009
Document Imaging Helps Organize IT
This reduces version control issues and ensures that a common naming (or "filing") structure is used across the board, making information easier to find. (For reference, an ImageRight "file" is a collection of documents organized together like a physical file that hangs in a file cabinet.) Plus, the ability to export individual documents or whole ImageRight "files" to a CD with an included viewer application is a great feature that I'm using as part of our Disaster Recovery preparations.
I have a single file that encompasses the contents of our network "runbook". This file contains server lists and configuration details, IP and DNS information, network maps, application and service dependencies, storage share locations/sizes, support contact information, etc. It consists of text documents, spreadsheets, PDF files and other types of data. I keep a hard copy printed at my desk so I can jot notes when changes are needed, but ImageRight ensures I have an electronic backup that I can edit on a regular basis. Plus, I regularly export a updated copy to a CD that I add to the off-site Disaster Recovery box.
The value of ImageRight in a disaster scenario expands beyond just our configuration documents. In an office where we deal with large amounts of paper, encouraging people to see that those documents are added to ImageRight in a timely manner will ensure faster access to work products after an event prevents access to the office or destroys paper originals.
Monday, September 21, 2009
Restoring ImageRight in the DR Scenario
The database is SQL 2005 and at this point it wasn't the first SQL restoration in this project, so that went relatively smoothly. We had some trouble restoring the "model" and "msdb" system databases, but our DBA decided those weren't critical to ImageRight and to let the versions from the clean installation stay.
Once the database was restored, I turned to the application server. A directory known as the "Imagewrt$" share is required as it holds all the installation and configuration files. We don't have all the same servers available in the lab, so we had to adjust the main configuration file to reflect the new location of this important share. After that, the application installation had several small hurdles that required a little experimentation and research to overcome.
First, the SQL Browser service is required to generate the connection string from the application server to the database. This service isn't automatically started in the standard SQL installation. Second, the ImageRight Application Service won't start until it can authenticate its DLL certificates against the http://crl.verisign.net URL. Our lab setup doesn't have an Internet connection at the moment so this required another small workaround - temporarily changing the IE settings for the service account to not require checking the publisher's certificate.
Once the application service was running, I installed the desktop client software on the machine that will provide remote desktop access to the application. That installed without any issue and the basic functions of searching for and opening image files were tested successfully. We don't have the disk space available in the lab to restore ALL the images and data, so any images older than when we upgraded to version 4.0 aren't available for viewing. We'll have to take note of the growth on a regular basis so that in the event of a real disaster we have a realistic idea of how much disk space is required. This isn't the first time I've run short during this test, so I'm learning my current estimates aren't accurate enough.
Of course, it hasn't been fully tested and there are some components I know we are using in production that might or might not be restored initially after a disaster. I'm sure I'll get a better idea of what else might be needed after we have some staff from other departments connect and do more realistic testing. Overall, I'm pretty impressed with how easy it was to get the basic functionality restored without having to call ImageRight tech support.
Thursday, September 17, 2009
Paper vs. Electronic - The Data Double Standard
The former is the information on our Exchange server, SQL servers, financial systems, file shares and the like. The the latter is the boxes and drawers of printed pages - some which originally started out on one of those servers (or a server that existed in the past) and some which did not. In the event of a serious disaster it would be impossible to recreate those paper files. Even if the majority of the documents could be located and reprinted any single group of employees would be unable to remember everything that existed in a single file, never mind hundreds of boxes or file cabinets. In the case of our office, many of those boxes contain data that dates back decades, containing handwritten forms and letters.
Like any good company, we have a high level plan that dictates what information systems are critical and the amount of data loss that will be tolerated in the event of an incident. This document makes it clear that our senior management understands the importance of what the servers in the data center contain. Ultimately, this drives our IT department's regular data backup policies and procedures.
However, IT is the only department required by this plan to ensure the recovery of the data we are custodians of. What extent of data loss is acceptable for the paper data owned by every other department after a fire or earthquake? A year of documents lost? 5 years? 10 years? No one has been held accountable for answering that question, yet most of those same departments won't accept more than a day's loss of email.
Granted, a lot of our paper documents are stored off site and only returned to the office when needed, but there are plenty of exceptions. Some staffers don't trust off site storage and keep their "most important" papers close by. Others in the office will tell you that the five boxes next to their cube aren't important enough to scan, yet are referenced so often they can't possibly be returned to storage.
And there lies the battle we wage daily as the custodians of the imaging system, simply getting everyone to understand the value of scanning documents into the system so they are included in our regular backups. Not only are they easier to organize, easier to access, more secure and subject to better auditing trails, there is a significant improvement in the chance of the survival when that frayed desk lamp cord goes unnoticed.
Thursday, September 3, 2009
Disaster Recovery Testing - Epic Fail #1
As I've mentioned before, my big project for this month is disaster recovery testing. A few things have changed since our last comprehensive test of our backup practices and we are long overdue. Because of this, I expect many "failures" along the way that will need to be remedied. I expect our network documentation to be lacking, I expect to be missing current versions of software in our disaster kit. I know for a fact that we don't have detailed recovery instructions for several new enterprise systems. This is why we test - to find and fix these shortcomings.
This week, at the beginning stages of the testing we encountered our first "failure". We've dubbed it "Epic Failure #1" and its all about those backup tapes.
A while back our outside auditor wanted us to password protect our tapes. We were running Symantec Backup Exec 10d at the time and were happy to comply. The password was promptly documented with our other important passwords. Our backup administrator successfully tested restores. Smiles all around.
We faithfully run backups daily. We run assorted restores every month to save lost Word documents, quickly migrate large file structures between servers, and to correct data corruption issues. We've had good luck with with the integrity of our tapes. More smiles.
Earlier this week, I load up the first tape I need to restore in my DR lab. I typed the password to catalog the tape and it tells me I have it wrong. I typed it again, because it's not an easy password and perhaps I had made a mistake. The error message appears, my smile did not.
After poking in the Backup Exec databases on production and comparing existing XML catalog files from a tape known to work with the password, we conclude that our regular daily backup jobs simply have a different password. Or at least the password hash is completely different, yet this difference is repeated across the password protected backup jobs on all our production backup media servers. Frown.
After testing a series of tapes from different points in time from different servers, we came the the following disturbing conclusion: The migration of our Backup Exec software from 10d to 12.5, which also required us to install version 11 as part of the upgrade path, mangled the password hashes on the pre-existing job settings. Or uses a different algorithm, or something similar with the same result.
Any tapes with backup jobs that came from the 10d version of the software use the known password without issue. And any new jobs that are created without a password (since 12.5 doesn't support media passwords anymore) are also fine. Tapes that have the "mystery password" on them are only readable by a media server that has the tape cataloged already, in this case the server that created it. So while they are useless in a full disaster scenario, they work for any current restorations we need in production. We upgraded Backup Exec just a few months ago, so the overall damage is limited to a specific time frame.
Correcting this issue required our backup administrator to create new jobs without password protection. Backup Exec 12.5 doesn't support that type of media protection anymore (it was removed in version 11) so there is no obvious way to remove the password from the original job. Once we have some fresh, reliable backups off-site I can continue with the disaster testing. We'll also have to look into testing the new tape encryption features in the current version of Backup Exec and see if we can use those to meet our audit requirements.
The lesson learned here was that even though the backup tapes were tested after the software upgrade, they should have been tested on a completely different media server. While our "routine" restore tasks showed our tapes had good data, it didn't prove they would still save us in a severe disaster scenario.
Saturday, August 22, 2009
Disaster Recovery - But for Real
Meanwhile, I'm managing some real disaster recovery, but on a smaller scale. A few weeks ago I posted about the need to upgrade our ImageRight installation to resolve a bug that could cause some data loss. The ImageRight support staff worked hard to run the preliminary discovery/fixing of the image files and database, followed by performing the upgrade.
Not long after, I got an email from someone in another department asking me to "find" the annotations added to a invoice that seemed to have gone missing. She was assuming that since some temporary help had worked on the document, a user error had been made and a "copy without annotations" had been introduced. I could recover the annotations by looking through the deleted pages and at previous versions of those pages.
However, what I found was a bit unexpected. I found a history of changes being made to the document, but no actual annotations visible. Curious.
So I opened a support ticket. After several remote sessions and research, the ImageRight team was "nearly positive" (they need more testing to confirm) that the process run before our last upgrade to correct the potential data loss, actually introduced a different kind of data loss. The result is that the database knows about the affected annotations happening, but the physical files that represent the annotated versions had been replaced with non-annotated versions.
We do have the logs from the original process, so it was just a matter of ImageRight Support parsing that data to generate a list of files that were changed. Now we begin the task of recovering those files from tape.
Our Sr. DBA had been working on side project that loads all our backup catalogs into a database so we have a comprehensive reference from all backup servers to identify what tapes to recall when people ask for recoveries. That project is proving its worth this time around, since we need to locate and restore over 1000 files. He also needs to cross referencing them to the individual documents accessible via the desktop client so we can do a visual comparison to any special cases and to provide a record of which documents were affected in a format that's understandable to everyone else, in case additional concerns come up after we repair the damage.
Our current plan is to have this resolved by the end of next weekend, but this is something that needs careful handling since we don't want end users to have any doubt about the integrity of the system, which I still have total confidence in once we sort this out. Thus, I'm happy to spend the extra time making sure no other issues are introduced.
Plus I need some time to find our DBA some really nice coffee for his efforts.
Wednesday, August 19, 2009
Dusting off the Disaster Recovery Plan
Success or failure will be measured by what road bumps we encounter and most importantly, our ability to work around them using only the resources in the box. If I have to go "outside the box" for some critical piece of software or some undocumented configuration detail it would be a black mark in our preparations that needs to be remedied.
Our testing scenario includes the domain, Exchange, the document imaging system, the financial system, the primary file server and the time card application. We are also going to provide remote access to restored applications so staff from other departments can test out the results and give us feedback on changes that could improve the end-user experience during this type of event. As an added bonus, we'll be able to try out Server 2008 R2 Remote Desktop Services.
In the last 6 months we started using VMWare ESX to consolidate some of our servers in production, but none of the machines needed for this scenario are virtual yet. I will be doing "classic" restores where the OS has to be installed before restoring our data from backup tapes. However, we are using VMWare to host several of the machines in the disaster lab, so I will be able to save time by cloning my first installation of Windows Server a few extra times before installing specific applications.
Depending on how this project goes, I'd like to see us take more advantage of virtualization within our disaster recovery planning and maybe start looking into backup solutions that are easier and faster than tape.