techbunny.com: disaster recovery

Showing posts with label disaster recovery. Show all posts

Monday, August 31, 2015

Business Continuity and the Cloud

This week marks the start of TechNet on Tour, coming to twelve cities. The full day workshops include lecture and hands-on-labs where you can learn about some of the ways you can utilize Microsoft Azure to help with your disaster recovery planning.

But let me tell you about the first "business continuity" plan I was part of. It involved a stash of tapes, daily backups on a two week cycle with the Friday backups being held for a month. The nightly backup job fit on two tapes and every morning, I ejected the tapes from the machine and dropped them in my bag. They went home with me, across town, and came back every day to be swapped with latest ones. Whenever I took a vacation, I designated an available person to perform the same task.

That was it. The tapes were rarely looked at, the data never tested and fortunately, never needed. We were partying like it was 1999. Because it was.

Still, the scenario isn't uncommon. There are still lots of small businesses, with only single locations and still lots of tapes out there. But now, there is more data and more urgency for that data to be recovered as quickly as possible with as little loss as possible. And there are still only 24 hours in the day. How annoying to arrive at work in the morning, only to find the overnight backup job still running.

As I moved through jobs and technologies evolved, we addressed the growing data and lack of time in many ways… Adjusting backup jobs to capture less critical or infrequently changing data only over the weekends. More jobs that only captured delta changes. Fancier multiple-tape changers, higher density tapes, local "disk to disk" backups that were later moved to tape, even early "Internet" backup solutions, often offered by the same companies that handled your physical tape and box rotation services.

We also chased that holy-grail of "uptime". Failures weren't supposed to happen if you threw enough hardware in a room. Dual power supplies, redundant disk arrays, multiple disk controllers, UPS systems with various bypass offerings. Add more layers to protect the computers, the data.

Testing was something we wanted to do more often. But it was hard justify additional hardware purchases to upper management. Hard to find the time to set up a comprehensive test. But we tried and often failed. And learned. Because each test or real outage is a great opportunity to learn. Outages are often perfect storms… if only we had swapped out that dying drive a day before, if only that piece of hardware was better labeled, if only that was better documented… and each time we made improvements.

I remember, after a lengthy call with a co-location facility that wanted us to sign a year agreement even though we only wanted space for 3 months to run a recovery test, how I wished for something I could just use for the time I needed. It's been a little over 5 years since that phone call, but finally there is an answer and it's "the cloud".

Is there failure in the cloud? Of course, it's inevitable. For all the abstractness, it's still just running on hardware. But the cloud provides part of an answer that many businesses simply didn't have even five years ago. Business that never recovered from the likes of Katrina and other natural or man-made disasters, might still have a shot today.

So catch a TechNet Tour if it passes through your area. Look at taking advantage of things like using the cloud as target instead of tape, or replicating a VM to Azure with Azure Site Recovery. Even starting to dabble in better documentation or scripting with PowerShell to make your key systems more consistently reproducible will go a long way. Do a "table top" dry run of your existing DR plan today.

Sysadmins don't let other sysadmins drop DLT tapes in their bags. Let's party like it's 2015. Because it is.

Wednesday, August 19, 2015

Summer Reads!

Ah, summertime…. Vacations, relaxing on the patio, fruit salads, sparkly drinks and learning. Right? I spent some time by the beach and the pool recently and then came back to a pile of interesting things I wanted to read or try out.

There are also two new video blogs available on Channel 9 that will keep adding new content you might want to check out.

Raw Tech - Lots of Dev stuff, but a pretty interesting assortment.
Azure Documentation Shorts - Quick videos covering the how-to of what's documented for Azure.

Monday, August 10, 2015

TechNet on Tour - Disaster Recovery!

We technical evangelists are at it again! This September and October, we will visit 10 cities to talk about using Microsoft Azure as part of your disaster recovery plan.

Attendees will receive a free Microsoft Azure pass and the opportunity to complete several disaster recovery related labs during the course of the workshop.

9/1 - Seattle, WA
9/3 - San Francisco, CA
9/22 - Houston, TX
9/29 - Charlotte, NC
9/30 - Malvern, PA
10/6 - Indianapolis, IN
10/7 - Tampa, FL
10/8 - New York, NY
10/14 - Irvine, CA
10/16 - Dallas, TX

Tuesday, February 25, 2014

I’ve Got Nothing: The DR Checklist

So what do you have to lose? If you’ve been reading along with the blog series, I hope you’ve been thinking a bit about ways you can bring your disaster recovery plans to the next level. My first post in the series on what to consider might have gotten you started on some of the items in this list. If you need some ideas of where to go next, or if you happen to be just starting out, here is a even longer list of things you might need.

Disclaimer: I love technology, I think that cloud computing and virtualization are paramount to increasing the speed you can get your data and services back online. But when disaster strikes, you can bet I’m reaching for something on paper to lead the way. You do not want your recovery plans to hinge on finding the power cable for that dusty laptop that is acting as the offline repository for your documentation. It’s old school, but it works. If you have a better suggestion than multiple copies of printed documentation, please let me know. Until then, finding a ring binder is my Item #0 on the list. (Okay, Hyper-V Recovery Manager is a pretty cool replacement for paper if you have two locations, but I'd probably still have something printed to check off...)

The Checklist

Backups - I always start at the backups. When your data center is reduced to a pile of rubble the only thing you may have to start with is your backups, everything else supports turning those backups into usable services again. Document out your backup schedule, what servers and data are backed up to what tapes or sets, how often those backups are tested and rotated. Take note if you are backing up whole servers as VMs, or just the data, or both. (If you haven’t yet, read Brian’s post on the value of virtual machines when it comes to disaster recovery.)
Facilities - Where are you and your backups going to come together to work this recovery magic? Your CEO’s garage? A secondary location that’s been predetermined? The Cloud? List out anything you know about facilities. If you have a hot site or cold site, include the address, phone numbers and access information. (Look at Keith’s blog about using Azure for a recovery location.)
People - Your DR plan should include a list of people who are part of the recovery process. First and foremost, note who has the right to declare a disaster in the first place. You need to know who can and can’t kick off a process that will start with having an entire set of backups delivered to an alternate location. Also include the contact information for the people you need to successfully complete a recovery - key IT, facilities and department heads might be needed. Don’t forget to include their backup person.
Support Services - Do you need to order equipment? Will you need support from a vendor? Include names and numbers of all these services and if possible, include alternatives outside of your immediate area. Your local vendor might not be available if the disaster is widespread like an earthquake or weather incident.
Employee Notification System - How do you plan on sharing information with employees about the status of the company and what services will be available to use? Your company might already have something in place - maybe a phone hotline or externally hosted emergency website. Make sure you are aware of it and know how you can get updates made to the information.
Diagrams, Configurations and Summaries - Include copies of any diagrams you have for networking and other interconnected systems. You'll be glad you have them for reference even if you don't build your recovery network the same way.
Hardware - Do you have appropriate hardware to recover to? Do you have the networking gear, cables and power to connect everything together and keep it running? You should list out the specifications of the hardware you are using now and what the minimum acceptable replacements would be. Include contact information for where to order hardware from and details about how to pay for equipment. Depending on the type of disaster you are recovering from, your hardware vendor might not be keen on accepting a purchase order or billing you later. If you are looking at Azure as a recovery location, make sure to note what size of compute power would match up.
Step-By-Step Guides - If you’ve started testing your system restores, you should have some guides formed. If your plans include building servers from the ground up, your guides should include references to the software versions and licensing keys required. When you are running your practice restores, anything that makes you step away from the guide should be noted. In my last disaster recovery book, I broke out the binder in sections, in order of recovery with the step-by-steps and supporting information in each area. (Extra credit if you have PowerShell ready to automate parts of this.)
Software - If a step in your process includes loading software, it needs to be available on physical media. You do not want to have to rely on having a working, high-speed Internet connect to download gigs of software.
Clients - Finally, don’t forget your end users. Your plan should include details about how they will be connecting, what equipment they would be expected to use if the office is not available and how you will initially communicate with them. Part of your testing should include having a pilot group of users attempt to access your test DR setup so you can improve the instructions they will be provided. Chances are, you’ll be too busy to make individual house calls. (For more, check out Matt’s post on using VDI as a way to protect client data.)

Once you have a first pass gathering of all your disaster recovery items and information, put it all in a container that you can send out to your off-site storage vendor or alternate location. Then when you practice, start with just the box - if you can’t kick off a recovery test with only the contents (no Internet connection and no touching your production systems) improve them and try again. Granted, if you are using the cloud as part of your plan, make sure you know which parts require Internet access, have a procedure for alternative connectivity and know what parts of your plans would stall while securing that connection. You won't be able to plan for every contingency, but knowing where parts of the plan can break down makes it easier to justify where to spend money for improvement, or not.

No matter the result of your testing, it will be better than the last time. Go forth and be prepared.

Oh, one more thing, if you live in a geographic area where weather or other "earthly" disasters are probable, please take some time to do some DR planning for your home as well. I don't care who you work for, if your home and family aren't secure after a disaster you certainly won't be effective at work. Visit www.ready.gov or www.redcross.org/prepare/disaster-safety-library for more information.

This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/

Tuesday, February 18, 2014

Question: Is there value in testing your Disaster Recovery Plan?

Answer: Only if you want a shot at it actually working when you need it.

There are a few reasons you need to regularly test your recovery plans… I’ve got my top three.

Backups only work if they are good.
Your documentation is only useful if you can follow it.
You are soft and easily crushed.

Backups
Everyone knows the mantra of “backup, backup, backup” but you also have to test those backups for accuracy and functionality. I’m not going to beat this one endlessly, but please read an old post of mine - “Epic Fail #1” to see how backups can fail in spectacular, unplanned ways.

Documentation
Simply put, you need good documentation. You need easy to locate lists of vendors, support numbers, configuration details of machines and applications, notes on how “this” interacts with “that”, what services have dependencies on others and step by step instructions for processes you don’t do often and even those you DO do everyday.

When under pressure to troubleshoot an issue that is causing downtime, it’s likely you’ll loose track of where to find information you need to successfully recover. Having clean documentation will keep you calm and focused at a time you really need to have your head in the game.

Realistically, your documentation will be out of date when you use it. You won’t mean for it to be, but even if you have a great DR plan in place, I’ll bet you upgraded a system, changed vendors, or altered a process almost immediately after your update cycle. Regular review of your documents is a valuable part of testing, even if you don’t touch your lab.

My personal method is to keep a binder with hard copies of all my DR documentation handy. Whenever I change a system, I make a note on the hard-copy. Quarterly, I update the electronic version and reprint it. With the binder, I always have information handy in case the electronic version is not accessible and the version with the handwritten notes is often more up to date with the added margin notes. Even something declaring a section “THIS IS ENTIRELY WRONG NOW” can save someone hours of heading down the wrong path.

You
No one wants to contemplate their mortality, I completely understand. (Or maybe you just want to go on vacation without getting a call half way through. Shocker, right?) But if you happen to hold the only knowledge of how something works in your data center, then you are a walking liability for your company. You aren't securing your job by being the only person with the password to the schema admin account, for example. It only takes one run in with a cross-town bus to create a business continuity issue for your company that didn't even touch the data center.

This extends to your documentation. Those step-by-step instructions for recovery need to include information and tips that someone else on your team (or an outside consultant) can follow without having prior intimate knowledge of that system. Sometimes the first step is “Call Support, the number is 800-555-1212” and that’s okay.

The only way to find out what others don’t know is to test. Test with tabletop exercises, test with those backup tapes and test with that documentation. Pick a server or application and have someone who knows it best write the first draft and then hand it to someone else to try to follow. Fill in the blanks. Repeat. Repeat again.

A lot of this process requires only your time. Time you certainly won’t have when your CEO is breathing down you neck about recovering his email.

Additional Resources
This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/

If you are ready to take things further, check out Automated Disaster Recovery Testing with Hyper-V Replica and PowerShell - http://blogs.technet.com/b/keithmayer/archive/2012/10/05/automate-disaster-recovery-plan-with-windows-server-2012-hyper-v-replica-and-powershell-3-0.aspx

Tuesday, February 11, 2014

Disaster Recovery for IT Pros: How to Plan, What are the Considerations?

I've done a little disaster recovery planning in my day. As an IT Professional, it's really easy to get caught up in the day-to-day. We have users that need assistance, servers that need love (updates), applications that need upgrading... whatever today's problem is, it needed solving yesterday. Disaster Recovery is often the elephant in the room, the insurance you don't have time to buy. Everyone knows it's needed, no one ever wants to use it and often, there is no clear way to begin.

I've always thought that being an IT Pro is one of the most powerful, powerless jobs in existence. We have our fingers on the pulse of what makes our businesses run, we have access to ALL THE DATA and we have the power to control access and availability to the resources. But we are often slaves to the business - we are responsible for providing the best up times, the best solutions and the best support we can. Facing budgets we can't always control while trying to explain technology to people who don't have time to understand it.

So where do you begin when tasked with updating or creating your disaster recovery plan? The good news is you don't need money or lots of extra hardware to start good disaster recovery planning - grab the note-taking tools of your choice and start asking questions.

Here are my three main questions to get started:

What is the most important application or services in each business unit or for the business overall?
How much downtime is acceptable?
How much data loss is acceptable?

These are your considerations. Period. I didn't mention money, but I know you want to argue that you can't recover without it. And that is true. But until you know what your goal is, you have no idea how much it may or may not cost.

This post is one of many in disaster recovery series being penned by the IT Pro Evangelists at Microsoft. As the series progresses, you'll find the complete list on Brian Lewis's blog post, "Blog Series: DR Planning for IT Pros." We will cover tools and applications you can consider in you planning and get you started with using them. They have various costs, but until you know your goal, you won't know what tools will help and can't argue the budget.

So let's put the pencil to the paper and start answering those three questions.

Start at the top: Go to upper management, have your CTO or CIO to pull together a leadership meeting and rank what systems the business units use and what they think is needed first. Get them to look at the business overall and determine how much downtime is too much, how quickly do they want services recovered and how much data they are willing to lose.

When it comes to determining your internal SLA you do need to know what scenario you are planning for. Preparing for a riot that blocks access to your office is different than an earthquake that renders your data center a steaming pile of rubble. Ultimately, you want different plans for different scenarios, but if you must start somewhere, go with the worst case so you can cover all your bases.

But what if you can't get leadership to sit down for this, or they want you to come to the table first with draft plan. Just GUESS.

Seriously, you have your hand on the data center, you know the primary goals of your business. If it was your company, what do you think you need to recover first? Use your gut to get you started. Look at your data center and pick out some of the key services that likely need to be recovered first to support the business needs. Domain controllers, encryption key management systems, infrastructure services like DNS and DHCP, communication tools and connectivity to the Internet might float to the top.

Sort the List: People want email right away? Great, that also needs an Internet connection and access to your authentication system, like Active Directory. People want the document management system or CRM or some in-house app with a database back-end? Fabulous, you need your SQL Servers and maybe the web front-end or the server that supports the client application.

Gather Your Tools: Look at your list of loosely ranked servers, devices and appliances and start building a shopping list of things you need to even start recovery. I always start with the "steaming pile of rubble" scenario, so my list starts like this:

Contact information for hardware and software vendors
Contact information and locations where my data center can function temporarily
List of existing hardware and specifications that would need to be met or exceeded if ordering new equipment for recovery
List of operating systems and other software, with version details and support documentation
Names of the people in the company that would be crucial to the successful recovery of the data center

Type this all up. If any of the things listed above involve looking at a server or visiting a web page, remember that in the "pile-of-rubble" scenario you will likely not have access to those resources. Save it wherever you save this type of documentation. Then print out a copy and put it in a binder on your desk. Print out another copy, seal it in an envelope and take it home.

Congratulations! You are closer to a usable DR plan than you were before you started and we've just scratched the surface. Disaster Recovery planning is often pushed off until tomorrow. Whatever you have today, be it an outdated document from your company leadership, server documentation that is a year old, or NOTHING, you can take time each day to improve it. How you plan is going to depend on the needs of your organization and you won't be able to complete the process in a silo, but you can get started.

I really enjoy disaster recovery planning. It's challenging, it's ever changing and I haven't even mentioned how things like virtualization, Hyper-V Replica and Azure can be some of the tools you use. Stay tuned for more in the series about how some of those things can come into play. Sometimes the hardest part about disaster recovery planning is just getting started.

***

This is post part of a 15 part series on Disaster Recovery and Business Continuity planning by the US based Microsoft IT Evangelists. For the full list of articles in this series see the intro post located here: http://mythoughtsonit.com/2014/02/intro-to-series-disaster-recovery-planning-for-i-t-pros/

Wednesday, May 30, 2012

End of the Month Round Up

I'm looking forward to attending TechEd in Orlando in two weeks. If you haven't already signed up to attend, it might actually be too late! TechEd is sold out this year and they are accepting names for the waiting list only at this time. I imagine it will be a crazy time, filled with lots of learning and networking with peers.

I won't be speaking this year, but that just gives me more time to attend some of the great sessions - I'll be concentrating on Active Directory in Server 2012, Exchange 2010, PowerShell and some System Center.

If you are hoping for something more local to your home town, check out the Windows Server 2012 Community Roadshow. US locations will include Houston, Chicago, Irvine, New York and San Jose, just to name a few. Microsoft MVPs will be presenting the content, so don't miss out a free chance to prepare for the release of Server 2012.

Another notable event that's upcoming is the World IPv6 Launch. Check out which major ISPs and web companies are turning on IPv6 for the duration.

Finally, if you are looking to make some improvements to your personal, cloud-based storage and file management for your personal computers, take a look at SugarSync. I've been using it for several years and it's been an easy way for me to access files from multiple computers and keep everything synced and backed up. I've even got a link for a referral if you'd like to try it out.

Thursday, January 26, 2012

Recovering Exchange 2010 - Notes from the Field

With Exchange 2007/2010 more tightly integrated with Active Directory, recovering a server after a loss of hardware can be significantly easier than with previous version of Exchange. This is a boon for those of us in smaller offices where only one Exchange Server exists, holding multiple roles.

Check out this TechNet article with the basics for recovering Exchange 2010. However, there are some little tips that would be helpful, especially when you might be working under a stressful situtation to restore your mail system.

Make sure you know where your install directory is if Exchange isn't installed in the default location. If you don't have it written down as part of your disaster recovery documentation, you can get that information out of Active Directory using ADSIEDIT.
Make sure you know the additional syntax for "setup /m:RecoverServer" switch. If you need to change the target directory the proper syntax is /t:"D:\Microsoft\Exchange\V14" or whatever your custom path happens to be.
If you are planning on using the /InstallWindowsComponents switch to save some time with getting your IIS settings just right, make sure you've preinstalled the .NET Framework 3.5.1 feature set first.
Don't forget to preinstall the Office 2010 Filter Packs. You don't need them to complete the setup, but you will be reminded about them as a requirement.
Make sure you install your remote agent (or whatever components are necessary) for your backup software. Once the Exchange installation is restored, you'll need to mark your databases as "This database can be overwritten by a restore" so that you can restore the user data.

As always, planning ahead will save you in times of trouble. Happy disaster recovery planning (and testing)!

Thursday, October 20, 2011

Playing IT Fast and Loose

It's been a long time since I've been at work from dusk 'til dawn. I not saying that I'm the reason we have such fabulous uptime, there are a lot of factors that play into it. We've got a well rounded NetOps team, we try to buy decent hardware, we work to keep everything backed up and we don't screw with things when they are working. And we've been lucky for a long time.

It also helps that our business model doesn't require selling things to the public or answering to many external "customers". Which puts us in the interesting position where its almost okay if we are down for a day or two, as long as we can get things back to pretty close to where they were before they went down. That also sets up to make some very interesting decisions come budget time. They aren't necessarily "wrong", but they can end up being awkward at times.

For example, we've been working over the last two years to virtualize our infrastructure. This makes lots of sense for us - our office space requirements are shrinking and our servers aren't heavily utilized individually, yet we tend to need lots of individual servers due to our line of business. When our virtualization project finally got rolling, we opted to us a small array of SAN devices from Lefthand (now HP). We've always used Compaq/HP equipment, we've been very happy with the dependability of the physical hardware. Hard drives are considered consumables and we do expect failures of those from time to time, but whole systems really biting the dust? Not so much.

Because of all the factors I've mentioned, we made the decision to NOT mirror our SAN array. Or do any network RAID. (That's right, you can pause for a moment while the IT gods strike me down.) We opted for using all the space we could for data and weighed that against the odds of a failure that would destroyed the data on a SAN, rendering entire RAID 0 array useless.

Early this week, we came really close. We had a motherboard fail on one of the SANs, taking down our entire VM infrastructure. This included everything except the VoIP phone system and two major applications that have not yet been virtualized. We were down for about 18 hours total, which included one business day.

Granted, we spent the majority of our downtime waiting for parts from HP and planning for the ultimate worst - restoring everything from backup. While we may think highly of HP hardware overall, we don't think very highly of their 4-hour response windows on Sunday nights. Ultimately, over 99% of the data on the SAN survived the hardware failure and the VMs popped back into action as soon as the SAN came back online. We only had to restore one non-production server from backup after the motherboard replacement.

Today, our upper management complemented us on how we handled the issue and was pleased with how quickly we got everything working again.

Do I recommend not having redundancy on your critical systems? Nope.

But if your company management fully understands and agrees to the risks related to certain budgeting decisions, then as a IT Pro your job is to simply do the best you can with what you have and clearly define the potential results of certain failure scenarios.

Still, I'm thinking it might be a good time to hit Vegas, because Lady Luck was certainly on our side.

Wednesday, March 30, 2011

Tomorrow is World Backup Day

What have you backed up lately?

If you are a systems admin, you probably already have a backup solution in place at the office or for your clients. Take some time tomorrow to check in on those processes to make sure you aren't missing something important and that they are working the way you expect.

At home, check on or implement a solution for your important files and photos on your home computers. It can be as simple as purchasing a portable drive or using a cloud based solution. I'm a SugarSync fan myself. If you want to check out SugarSync for yourself, use this referral code and get some bonus free space.

With the proper backup solution in place, your home laptop can be almost instantly replaceable with no worries. I recently reinstalled the OS on my netbook and was able to sync all my data files right back on with SugarSync. It's easy and helps me sleep better at night!

Learn more about World Backup Day at http://www.worldbackupday.net/

Monday, November 29, 2010

The How and Why of an ImageRight Test Environment

Over the last few days, I've coordinated setting up a new test environment for ImageRight, now that we've upgraded to version 5. Our previous test environment was still running version 4, which made it all but useless for current workflow development. However, workflow development isn't the only reason to set up an alternate ImageRight system - there are some other cool uses.

ImageRight has an interesting back-end architecture. While it's highly dependant on Active Directory for authentication (if you use the integrated log on method), the information about what other servers the application server and the client software should interact with is completely controlled with database entries and XML setup files. Because of this you can have different ImageRight application servers, databases and image stores all on the same network with no conflicts or sharing of information. Yet, you don't need to provide a separate Active Directory infrastructure or network subnet.

While our ultimate goal was to provide a test/dev platform for our workflow designer, we also used this exercise as an opportunity to run a "mini" disaster recovery test so I could update our recovery documentation related to this system.

To set up a test environment, you'll need at least one server to hold all your ImageRight bits and pieces - the application server service, the database and the images themselves. For testing, we don't have enough storage available to restore our complete set of images, so we only copied a subset. Our database was a complete restoration, so test users will see a message about the system being unable to locate documents that weren't copied over.

I recommend referring to both the "ImageRight Version 5 Installation Guide" and the "Create a Test Environment" documents available on the Vertafore website for ImageRight clients. The installation guide will give you all the perquisites need to run ImageRight and the document on test environments has details of what XML files need to be edited to ensure that your test server is properly isolated from your production environment. Once you've restored your database, image stores and install share (aka "Imagewrt$), its quick and easy to tweak the XML files and get ImageRight up and running.

For our disaster recovery preparations, I updated our overall information about ImageRight, our step-by-step guide for recovery and burned a copy of our install share to a DVD so it can be included in our off-site DR kit. While you can download a copy of the official ImageRight ISO, I prefer to keep a copy of our expanded "Imagewrt$" share instead - especially since we've added hotfixes to the version we are running, which could differ from the current ISO available online from Vertafore.

Because setting up the test enviroment was so easy, I could also see a use where some companies may want to use alternate ImageRight environments for extra sensitive documents, like payroll or HR. I can't speak for the additional licensing costs of having a second ImageRight setup specificially for production, but it's certainly technicially possible if using different permissions on drawers and documents doesn't meet the business requirements for some departments.

Monday, December 7, 2009

If You Build It, Can They Come?

I've posted several times about working on a disaster recovery project at the office using Server 2008 Terminal Services. We've officially completed the testing and had some regular staffers log on and check things out. That was probably one of the most interesting parts.

One issue with end user access was problems with the Terminal Services ActiveX components on Windows XP SP3. This is disabled by default as part of a security update in SP3. This can usually be fixed with a registry change which I posted about before, however that requires local administrative privileges that not all our testing users had. There are also ActiveX version issues if the client machine is running an XP service pack that is earlier than SP3.

Administrative privileges also caused some hiccups with one of our published web apps that required a Java plug-in. At one point, the web page required a Java update that could only be installed by a server administrator and this caused logon errors for all the users until that was addressed.

In this lab setting, we had also restored our file server to a different OS. Our production file server is Windows 2000 and in the lab we used Windows 2008. This resulted in some access permission issues for some shared and "home" directories. We didn't spend any time troubleshooting the problem this time around, but when we do look to upgrade that server or repeat this disaster recovery test we know to look into the permissions more closely.

Users also experienced trouble getting Outlook 2007 to run properly. I did not have issues when I tested my own -there were some dialog boxes that needed to be address before it ran for the first time to confirm the username and such. While the answers to those boxes seem second nature to those of us in IT, we realized that will need to provide better documentation to ensure that users get email working right the first time.

In the end, detailed documentation proved to be the most important aspect of rolling this test environment out to end users. In the event of a disaster, it's likely that our primary way of sharing initial access information would be by posting instructions to the Internet. Providing easy to follow instructions that include step-by-step screenshots that can be followed independently are critical. After a disaster, I don't expect my department will have a lot of time for individual hand-holding for each user that will be using remote access.

Not only did this project provide an opportunity to update our procedures used to restore services, it showed that it's equally as important to make sure that end users have instructions so they can independently access those services once they are available.

Wednesday, October 14, 2009

Document Imaging Helps Organize IT

Since our implementation of ImageRight, our Network Operations team has embraced it as a way to organize our server and application documentation in a manner that makes it accessible to everyone in our team. Any support tickets, change control documents, white papers and configuration information that is stored in ImageRight is available to anyone in our group for reference.

This reduces version control issues and ensures that a common naming (or "filing") structure is used across the board, making information easier to find. (For reference, an ImageRight "file" is a collection of documents organized together like a physical file that hangs in a file cabinet.) Plus, the ability to export individual documents or whole ImageRight "files" to a CD with an included viewer application is a great feature that I'm using as part of our Disaster Recovery preparations.

I have a single file that encompasses the contents of our network "runbook". This file contains server lists and configuration details, IP and DNS information, network maps, application and service dependencies, storage share locations/sizes, support contact information, etc. It consists of text documents, spreadsheets, PDF files and other types of data. I keep a hard copy printed at my desk so I can jot notes when changes are needed, but ImageRight ensures I have an electronic backup that I can edit on a regular basis. Plus, I regularly export a updated copy to a CD that I add to the off-site Disaster Recovery box.

The value of ImageRight in a disaster scenario expands beyond just our configuration documents. In an office where we deal with large amounts of paper, encouraging people to see that those documents are added to ImageRight in a timely manner will ensure faster access to work products after an event prevents access to the office or destroys paper originals.

Monday, September 21, 2009

Restoring ImageRight in the DR Scenario

Our document imaging system, ImageRight, is one of the key applications that we need to get running as soon as possible after a disaster. We've been using the system for over 2 years now and this is the first time we've had a chance to look closely at what would be necessary in a full recovery scenario. I'd been part of the installation and the upgrade of the application, so I had a good idea of how it should be installed. Also, I had some very general instructions from the ImageRight staff regarding recovery, but no step by step instructions.

The database is SQL 2005 and at this point it wasn't the first SQL restoration in this project, so that went relatively smoothly. We had some trouble restoring the "model" and "msdb" system databases, but our DBA decided those weren't critical to ImageRight and to let the versions from the clean installation stay.

Once the database was restored, I turned to the application server. A directory known as the "Imagewrt$" share is required as it holds all the installation and configuration files. We don't have all the same servers available in the lab, so we had to adjust the main configuration file to reflect the new location of this important share. After that, the application installation had several small hurdles that required a little experimentation and research to overcome.

First, the SQL Browser service is required to generate the connection string from the application server to the database. This service isn't automatically started in the standard SQL installation. Second, the ImageRight Application Service won't start until it can authenticate its DLL certificates against the http://crl.verisign.net URL. Our lab setup doesn't have an Internet connection at the moment so this required another small workaround - temporarily changing the IE settings for the service account to not require checking the publisher's certificate.

Once the application service was running, I installed the desktop client software on the machine that will provide remote desktop access to the application. That installed without any issue and the basic functions of searching for and opening image files were tested successfully. We don't have the disk space available in the lab to restore ALL the images and data, so any images older than when we upgraded to version 4.0 aren't available for viewing. We'll have to take note of the growth on a regular basis so that in the event of a real disaster we have a realistic idea of how much disk space is required. This isn't the first time I've run short during this test, so I'm learning my current estimates aren't accurate enough.

Of course, it hasn't been fully tested and there are some components I know we are using in production that might or might not be restored initially after a disaster. I'm sure I'll get a better idea of what else might be needed after we have some staff from other departments connect and do more realistic testing. Overall, I'm pretty impressed with how easy it was to get the basic functionality restored without having to call ImageRight tech support.

Thursday, September 17, 2009

Paper vs. Electronic - The Data Double Standard

One of the main enterprise applications I'm partly responsible for administering at work is our document imaging system. Two years have passed since implementation and we still have some areas of the office dragging their feet about scanning their paper. On a daily basis, I still struggle with the one big elephant in the room - the double standard that exists between electronic data and data that is on paper.

The former is the information on our Exchange server, SQL servers, financial systems, file shares and the like. The the latter is the boxes and drawers of printed pages - some which originally started out on one of those servers (or a server that existed in the past) and some which did not. In the event of a serious disaster it would be impossible to recreate those paper files. Even if the majority of the documents could be located and reprinted any single group of employees would be unable to remember everything that existed in a single file, never mind hundreds of boxes or file cabinets. In the case of our office, many of those boxes contain data that dates back decades, containing handwritten forms and letters.

Like any good company, we have a high level plan that dictates what information systems are critical and the amount of data loss that will be tolerated in the event of an incident. This document makes it clear that our senior management understands the importance of what the servers in the data center contain. Ultimately, this drives our IT department's regular data backup policies and procedures.

However, IT is the only department required by this plan to ensure the recovery of the data we are custodians of. What extent of data loss is acceptable for the paper data owned by every other department after a fire or earthquake? A year of documents lost? 5 years? 10 years? No one has been held accountable for answering that question, yet most of those same departments won't accept more than a day's loss of email.

Granted, a lot of our paper documents are stored off site and only returned to the office when needed, but there are plenty of exceptions. Some staffers don't trust off site storage and keep their "most important" papers close by. Others in the office will tell you that the five boxes next to their cube aren't important enough to scan, yet are referenced so often they can't possibly be returned to storage.

And there lies the battle we wage daily as the custodians of the imaging system, simply getting everyone to understand the value of scanning documents into the system so they are included in our regular backups. Not only are they easier to organize, easier to access, more secure and subject to better auditing trails, there is a significant improvement in the chance of the survival when that frayed desk lamp cord goes unnoticed.

Thursday, September 3, 2009

Disaster Recovery Testing - Epic Fail #1

As I've mentioned before, my big project for this month is disaster recovery testing. A few things have changed since our last comprehensive test of our backup practices and we are long overdue. Because of this, I expect many "failures" along the way that will need to be remedied. I expect our network documentation to be lacking, I expect to be missing current versions of software in our disaster kit. I know for a fact that we don't have detailed recovery instructions for several new enterprise systems. This is why we test - to find and fix these shortcomings.

This week, at the beginning stages of the testing we encountered our first "failure". We've dubbed it "Epic Failure #1" and its all about those backup tapes.

A while back our outside auditor wanted us to password protect our tapes. We were running Symantec Backup Exec 10d at the time and were happy to comply. The password was promptly documented with our other important passwords. Our backup administrator successfully tested restores. Smiles all around.

We faithfully run backups daily. We run assorted restores every month to save lost Word documents, quickly migrate large file structures between servers, and to correct data corruption issues. We've had good luck with with the integrity of our tapes. More smiles.

Earlier this week, I load up the first tape I need to restore in my DR lab. I typed the password to catalog the tape and it tells me I have it wrong. I typed it again, because it's not an easy password and perhaps I had made a mistake. The error message appears, my smile did not.

After poking in the Backup Exec databases on production and comparing existing XML catalog files from a tape known to work with the password, we conclude that our regular daily backup jobs simply have a different password. Or at least the password hash is completely different, yet this difference is repeated across the password protected backup jobs on all our production backup media servers. Frown.

After testing a series of tapes from different points in time from different servers, we came the the following disturbing conclusion: The migration of our Backup Exec software from 10d to 12.5, which also required us to install version 11 as part of the upgrade path, mangled the password hashes on the pre-existing job settings. Or uses a different algorithm, or something similar with the same result.

Any tapes with backup jobs that came from the 10d version of the software use the known password without issue. And any new jobs that are created without a password (since 12.5 doesn't support media passwords anymore) are also fine. Tapes that have the "mystery password" on them are only readable by a media server that has the tape cataloged already, in this case the server that created it. So while they are useless in a full disaster scenario, they work for any current restorations we need in production. We upgraded Backup Exec just a few months ago, so the overall damage is limited to a specific time frame.

Correcting this issue required our backup administrator to create new jobs without password protection. Backup Exec 12.5 doesn't support that type of media protection anymore (it was removed in version 11) so there is no obvious way to remove the password from the original job. Once we have some fresh, reliable backups off-site I can continue with the disaster testing. We'll also have to look into testing the new tape encryption features in the current version of Backup Exec and see if we can use those to meet our audit requirements.

The lesson learned here was that even though the backup tapes were tested after the software upgrade, they should have been tested on a completely different media server. While our "routine" restore tasks showed our tapes had good data, it didn't prove they would still save us in a severe disaster scenario.

Saturday, August 22, 2009

Disaster Recovery - But for Real

This past week I've been doing the preliminary work (installing servers mostly) to get ready for our scheduled disaster recovery test. I expect that we'll learn a lot about our existing plan and systems documentation and will be looking to make some changes that will make any need for a large recovery faster and more effective.

Meanwhile, I'm managing some real disaster recovery, but on a smaller scale. A few weeks ago I posted about the need to upgrade our ImageRight installation to resolve a bug that could cause some data loss. The ImageRight support staff worked hard to run the preliminary discovery/fixing of the image files and database, followed by performing the upgrade.

Not long after, I got an email from someone in another department asking me to "find" the annotations added to a invoice that seemed to have gone missing. She was assuming that since some temporary help had worked on the document, a user error had been made and a "copy without annotations" had been introduced. I could recover the annotations by looking through the deleted pages and at previous versions of those pages.

However, what I found was a bit unexpected. I found a history of changes being made to the document, but no actual annotations visible. Curious.

So I opened a support ticket. After several remote sessions and research, the ImageRight team was "nearly positive" (they need more testing to confirm) that the process run before our last upgrade to correct the potential data loss, actually introduced a different kind of data loss. The result is that the database knows about the affected annotations happening, but the physical files that represent the annotated versions had been replaced with non-annotated versions.

We do have the logs from the original process, so it was just a matter of ImageRight Support parsing that data to generate a list of files that were changed. Now we begin the task of recovering those files from tape.

Our Sr. DBA had been working on side project that loads all our backup catalogs into a database so we have a comprehensive reference from all backup servers to identify what tapes to recall when people ask for recoveries. That project is proving its worth this time around, since we need to locate and restore over 1000 files. He also needs to cross referencing them to the individual documents accessible via the desktop client so we can do a visual comparison to any special cases and to provide a record of which documents were affected in a format that's understandable to everyone else, in case additional concerns come up after we repair the damage.

Our current plan is to have this resolved by the end of next weekend, but this is something that needs careful handling since we don't want end users to have any doubt about the integrity of the system, which I still have total confidence in once we sort this out. Thus, I'm happy to spend the extra time making sure no other issues are introduced.

Plus I need some time to find our DBA some really nice coffee for his efforts.

Wednesday, August 19, 2009

Dusting off the Disaster Recovery Plan

This week, I started testing our department's disaster recovery plan. The goal is to use the contents of our existing "disaster recovery box" that we keep off-site combined with our current backup tapes to restore some key parts of our infrastructure.

Success or failure will be measured by what road bumps we encounter and most importantly, our ability to work around them using only the resources in the box. If I have to go "outside the box" for some critical piece of software or some undocumented configuration detail it would be a black mark in our preparations that needs to be remedied.

Our testing scenario includes the domain, Exchange, the document imaging system, the financial system, the primary file server and the time card application. We are also going to provide remote access to restored applications so staff from other departments can test out the results and give us feedback on changes that could improve the end-user experience during this type of event. As an added bonus, we'll be able to try out Server 2008 R2 Remote Desktop Services.

In the last 6 months we started using VMWare ESX to consolidate some of our servers in production, but none of the machines needed for this scenario are virtual yet. I will be doing "classic" restores where the OS has to be installed before restoring our data from backup tapes. However, we are using VMWare to host several of the machines in the disaster lab, so I will be able to save time by cloning my first installation of Windows Server a few extra times before installing specific applications.

Depending on how this project goes, I'd like to see us take more advantage of virtualization within our disaster recovery planning and maybe start looking into backup solutions that are easier and faster than tape.

Monday, August 31, 2015

Wednesday, August 19, 2015

Monday, August 10, 2015

Tuesday, February 25, 2014

Tuesday, February 18, 2014

Tuesday, February 11, 2014

Wednesday, May 30, 2012

Thursday, January 26, 2012

Thursday, October 20, 2011

Wednesday, March 30, 2011

Monday, November 29, 2010

Monday, December 7, 2009

Wednesday, October 14, 2009

Monday, September 21, 2009

Thursday, September 17, 2009

Thursday, September 3, 2009

Saturday, August 22, 2009

Wednesday, August 19, 2009

MS ITPro Evangelists Blogs

More Great Blogs

Pages

IT Communities and Events

Software Evaluations & Downloads

And There's More!

This Week's Popular Posts

Labels

50 Must-Read IT Blogs 2014

Blog Archive