Drilling for Disaster Recovery

On an Air Force base in San Antonio, Texas, two men walk into the base exchange. They show their IDs to the clerks, don Halloween masks, and proceed to play the part of terrorists. “This is a drill! We want liberation!  This is only a drill!” Airmen shopping in the facility give them funny looks, but wait patiently to see what happens. After the men are satisfied that they have sufficiently tested the reactions of the store clerks, they move toward the door, proclaiming their allegiance to some pretend organization and repeating, “This is a drill! This is only a drill!”  Why all the fuss? Wouldn’t it be enough to instruct the store clerks in what to do if two terrorists walked in the door? No, the Air Force wanted to make sure that they were ready.

What about you and your organization? Are you ready if the worst should happen? Are you doing disaster recovery (DR) testing?

Be Prepared!

You all know the Boy Scout motto. It’s short and sweet: “Be prepared!” We may hope for a life of ease and success, along with uneventful days in the data center, but it would be foolish to ignore the fact that disaster could strike at any moment. Some troubles may seem insignificant at the time, but the impact on customers could do great damage to your company’s reputation — and the bottom line.

Perhaps the first thing to do in anticipating potential disasters would be to brainstorm and list them. What’s the worst that could happen? Dale Carnegie, in his book How to Stop Worrying and Start Living, shares how he deals with potential disasters:

“First, I ask myself what is the worst that can possibly happen. Second, I try to accept it mentally. Third, I concentrate on the problem and see how I can improve the worst which I am already willing to accept – if I have to.”

Data center managers do not have the luxury of putting their heads in the sand or wallowing in worry when it comes to warding off the worst. It’s best to accept that things will go wrong — we just don’t know exactly what or when — and then do something about it. That is why every data center should have a clearly written disaster recovery plan. It should account for possible events and include any number of possible scenarios. The company Silverback Data Center Solutions suggests ten:

  1. Cascading systems failure
  2. Data corruption / loss
  3. Hacking / malicious code
  4. Earthquake
  5. Human error
  6. False redundancy
  7. Power failure
  8. Fiber cut / loss of network
  9. Fire
  10. Flood

 

Just because you have a plan doesn’t mean that you have covered all the bases. What if your disaster recovery plan fails — in the middle of the disaster? DR testing is essential to the success of real-life disaster recovery. “An end-to-end disaster recovery exercise is the only way to effectively build the confidence among stakeholders on the recoverability of the disaster recovery environment,” writes Shankar Subramaniyan for Disaster Recovery Journal.

Types of DR Testing

Actually, there are various forms of DR testing. According to TechTarget writers Paul Kirvan and Sonia Lelii, there many possible tests. A disaster recovery plan, which is a subset of the larger Business Continuity Plan (BCP), can be evaluated in three primary ways:

  1. A plan review
  2. A tabletop test
  3. A simulation

 

During a BCP/DRP review, the owner of the plan sits down with the team and discusses the plan in detail. The owner could be a technical or business manager within the company, or it could even be a consultant or outsourced IT firm contracted for that purpose.

A tabletop test, explained very clearly in a video from the disaster recovery experts at Databarracks, is a walk-through of recovery plans without actually performing any of the actions. Presenter James Watts points to a tabletop test sample in the form of a tutorial that you can run yourself, called “Fire in the Server Room”. Using something called a runbook as a source of information, the tutorial challenges the learner with tasks like, “Find the designated assembly point to evacuate all staff.” Watts compares the tabletop test to the multi-user game Dungeons and Dragons. A tabletop test should include the following people:

  • Facilitator (third party?)
  • Leader
  • Department heads
  • Suppliers and other third parties
  • Note-taker

 

A simulation would be a run-through of any of the scenarios that you anticipate might happen (see the top ten list above). This could mean the involvement of many more people in acting out a potential disaster. Caution should be taken not to bring down any actual customers on the live network, unless it is done during a scheduled maintenance window.

Time and Scope of DR Testing

Whatever schedule you and your business select for your IT disaster recovery testing, it should be regular and consistent. Experts recommend some kind of partial testing every quarter and full-scale testing every year. You may even want to set something up for every month, or possibly offer techs some DR exercises that they can do during idle time while monitoring the network.

There should be a clear delineation between the DR test environment and the production environment.

There should be a clear delineation between the DR test environment and the production environment. To protect customers, you may want to integrate your DR testing with your change control procedures. Change control includes such things as methods of procedure (MOPs), approvals, and rollbacks. You can never be too careful.

The next obvious question arises:  What should be tested? It wouldn’t do to say “everything” — unless, of course, you’re prepared to define what “everything” is. You would do that by implementing a complete inventory, or using one that already exists. Everything that you want to test should be in that inventory. In a Youtube video, technology professional James Myers identifies four areas that should be in your disaster recovery plan:

  1. Network
  2. Database
  3. Desktop
  4. Server

 

Of course, we no longer think of those components as equipment only. In this new world of virtualization and cloud computing, so much of this has been moved to software.

Virtualization and DR Testing

There are claims across the internet about the benefits of virtualization when it comes to disaster recovery testing. In a quick tips video, Brad Wagner of Vizioncore tells us that replicating to virtual machines can make DR testing much easier, less expensive, and less complex. He says to leverage the benefits of virtualization and the use of containers to create a separate DR testing environment.

But Chris Evans, writing for Computer Weekly, shares what he calls “the truth about virtualization and disaster recovery”. He warns of “uncertainty” and “false claims from suppliers”. As a reminder, Evans defines the term for us: “Virtualisation abstracts the physical resources of the server into logical constructs that represent hard drives, network cards and disk controllers.”  Despite his warning, Evans’ concerns seem to fade away as he goes on to praise the merits of virtualization:

  • Simple backup/restore
  • VM migration
  • High availability/fault tolerance
  • Continual backup
  • Application resiliency

 

The ease and effectiveness of point-and-click virtualization should be considered when designing and implementing any DR testing. Why would anyone want to buy all that expensive hardware these days? If an IT infrastructure component is virtual, creating a test environment using a virtual clone makes sense. But there is a valid argument for testing hardware-based resources using actual hardware.

Configuration Drift and DR Monitoring Tools

One of the biggest problems with a DRP is that it can get stale pretty quickly. You finalize your plan today, and tomorrow somebody does significant changes to the network. And what happens when someone changes the IP address of a server in your plan? Experts say you should avoid using fixed IP addresses and use hostnames instead, or devise some other way to maintain control.

But it’s very hard to keep control of an ever-changing, growing network. IT infrastructures are dynamic things, almost as if they were alive. (We talk about heartbeats and live networks, so maybe they are in a way.) To deal with the constant changes, vendors have developed DR monitoring tools. You can read more about it in an article called “Disaster recovery monitoring tools boost data protection and simplify DR”. The writer talks about three types of tools:

  1. Software used to store info about the DRP
  2. Tools to help set up scenarios
  3. Passive tools that aggregate the data protection processes that are going on

 

Maybe in a few years all these disaster recovery testing functions will be automated. Just punch a button, and then wait for the results. For now, we need to remain vigilant.

Measuring DR Testing

Two terms that we haven’t mentioned yet are very important in the disaster recovery vocabulary. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are metrics that should be assigned prior to any DR test. RTO is the amount of time a business can tolerate the outage of an IT service. You might say that the service has to be back online within this time frame. RPO is the previous point in time for an application to be recovered.

There will be different metrics for different services, depending on how critical they are to the business. Services for banking or financial institutions, for instance, should have an RTO at or near zero (RTO = 0 minutes). Backend reporting might have more flexibility (e.g., RTO = 240). A critical database might want a backup from only five minutes before the disaster (RPO=5), while a less important database might have a different number (e.g., RPO=360).

Both RTO and RPO should be defined in the disaster recovery plan. Following any testing, actual results should be compared to these benchmarks. In fact, it would make sense to conduct DR testing in much the same way that software testing is done, using test cases for each critical function, comparing actual results with expected results, and writing reports accordingly.

Conclusion

We could not emphasize strongly enough the necessity to run these disaster recovery drills regularly. You will not only improve the response time involved in recovery actions, but you will find areas of weakness that should be addressed. You will learn that every DR testing failure is actually a success, simply because it gives you the opportunity to correct and improve your DRP before an actual disaster occurs. Running drills is a good thing. Hospitals run disaster preparedness drills. Schools have fire drills. Military bases drill to prepare for terrorist infiltration. Regularly scheduled DR testing will help make your enterprise better prepared for the worst-case scenario. As they say, practice makes perfect.

Prevent your next outage now!

TRY IT FREE

Other articles you might like to read: