Many disaster recovery plans (DRPs) focus attention on the “big” threats that have the largest impact such as natural disasters and man-made events. While these are all important and need to be both addressed and planned for, the probability of some of these events occurring could be extremely low.
Sonny Bennett looks at where organizations can prevent downtime by shifting the focus of their disaster recovery plans to events that are more likely to happen and are much easier to prevent.
Every organization, irrespective of size or revenue, should have plans in place to ensure they can continue operating in the event of a disaster or disruptive event. The two most important documents are the business impact analysis document (BIA) and the disaster recovery plan.
Business impact analysis (BIA):
The business impact analysis identifies the most important business functions, their criticality and dependencies, for example other parts of the business, systems required and the people/roles that they are reliant. The BIA also includes a risk assessment detailing potential threats and vulnerabilities that could disrupt operations. Essentially the BIA produces a hierarchy of systems or services and the order that they need to be brought back online for the business to operate.
Disaster recovery plan (DRP):
The disaster recovery plan defines the procedures and detailed instructions that are used to bring services back online, as quickly as possible to get the business back up and running. Typically, this may mean running at a degraded level of performance, depending on the disruption.
Many BIA and DRPs have sections on how to recover from natural disasters or “acts of god” including avalanches, earthquakes, wildfires, floods, hurricanes, lightning strikes, solar flares, tsunamis, volcanic eruptions, etc. These threats will vary depending on where the businesses are located, along with other man-made risks that need to be addressed, such as bioterrorism, civil unrest, fire, hazardous material spills, nuclear radiation, power failure and theft.
Having identified potential threats these can be categorized and prioritized based on their impact and probability as shown by the matrix. For these kind of threats, potential impact on business operations is high yet probability is low.
What some BIA and DRPs fail to recognize
There’s no disagreement that “big” threats have the largest impact and need to be addressed and planned for.
However, failing to ignore “small” events that are more likely to happen can cause unnecessary downtime, loss of productivity and business cost. Concentrating efforts on minimising these threats is often easier too; it’s difficult and expensive to adequately prevent against natural disasters and the majority of man-made events.
Where to start: IT failures
IT failures is one area that should be focused on. Risks can range from server failures, network outages, disk failures, data corruption, malicious attacks and human errors. A simple way to address some of these issues is to build resilient solutions that eliminate as many single points of failure and ensure a single component failure does not lead to extended periods of downtime.
This would mean duplicating the amount of servers required and providing a clustered solution. Additional resiliency could be provided by:
- Physically separating servers into different locations (continents, data centers or racks) to protect against natural and man-made disasters
- Using redundant power supplies, distribution boards, and uninterruptable power supplies (UPS) to protect against power outages
- Providing multiple independent network connections, using different network interface cards (NICs), switches/routers with cabling using diverse routes to eliminate network failures
- Employ disk protection mechanisms such as synchronous data mirroring, RAID protection, erasure encoding, hot spares, along with disk controllers that have battery backup to minimize data loss
- Ensure that a backup strategy is in place to protect against logical data corruption, this could be a traditional tape solution or snapshot based
Providing a redundant solution will solve many of the issues, but would increase the cost of the solution, these costs need to be weighed up against the cost to the business of a service outage, this may not only be in terms of revenue loss but also what impact it has on business credibility.
Ensuring data availability
Delivering resilient clustered solutions that ensure application uptime, data availability and keep businesses running in the event of a disaster require shared storage. StorMagic SvSAN provides cost effective storage enabling service continuity.
SvSAN is a virtual SAN solution that uses the server’s internal disk capacity to deliver the required highly available shared storage, while eliminating single points of failure and reducing the cost and complexity associated with traditional storage arrays.