Avoiding downtime and ensuring maximum service availability or uptime is a critical requirement for many organizations that operate around the clock. Fundamentally, this requirement applies to IT infrastructure that supports the business, no matter where it is located – from centralized data centers to remote sites and more recently the “cloud”.
In this article we look at what is meant by availability, how it is measured, the impact of an outage and how to mitigate against downtime, especially as more workloads are moving to cloud-based infrastructures.
How is availability defined?
In an ideal world, organizations would deliver 100% service availability or, to put it another way, they would have no downtime. For this to happen, there would be no equipment failures, power or network outages, no software bugs requiring software patches, etc. However, in reality, this is not possible; IT components (servers, disk drives, networks) fail, power outages or “brownouts” occur, network connectivity drops and software upgrades are necessary to fix bugs, security flaws or add functionality. These all result in reductions in service availability.
To allow for these events, organizations define service-level agreements (SLAs) for each application or service stating the acceptable periods of both planned and unplanned “downtime”. SLAs are typically expressed as a percentage or the probability that a system is operational in a given time period.
Typically, availability is referred to by the number of nines (9s) systems or applications have. For example, “3-nines” is “99.9%” or 8.76 hours of downtime per year, increasing the number of “nines” equates to higher service availability, as shown in the below table.
Popular cloud services offer monthly uptime percentages of at least 99.95% for virtual machines, allowing for up to 22 minutes of downtime per month.
Does the number of 9s matter?
An outage is an outage irrespective of what the SLA states the availability should be. It’s the impact of that outage that really matters.
Defining availability as a percentage or number of nines as described above is fundamentally flawed, as it assumes that all time has an equal value to an organization, which is simply not the case. Take the retail chain as an example. If a 5-minute server outage occurs during store-closed hours or at a quiet period, then the impact of an outage would be low – minimal customers affected with little loss of revenue. However if the same outage occurred during a peak period, or during a promotion/sale period such as “Black Friday” the impact would be much greater and far-reaching – leading to unhappy customers, lost revenue (abandoned shopping carts) and/or damage to reputation, and in some cases this could affect future revenue or stock price.
Defining availability in this way only takes into account the length of the outage and does not address the overall impact to the business, which should also be factored into any SLA.
In addition to the availability percentage, there are two other common metrics associated with availability or recoverability from an outage. These are the recovery time objective (RTO) and recovery point objective (RPO). These define the time it takes to recover the service and the amount of data loss that can be tolerated as a result of an outage. Having short RTOs or RPOs that define minimal or zero data loss, result in expensive or specialist IT solutions. Again both the RTO and RPO do not address the impact to the business.
Deep dive continued in part 2
Join us in Part 2 of our deep dive as we examine the impact of cloud on availability and more.
If you’d like to find out more about the cloud in the meantime, check out our article: Leveraging the Internet of Things? The cloud is not your holy grail