When I work with customers to design new applications, I come across a very common story, that while understandable, is leaving organisations open to significant operational risk and reputational damage.
Below, I summarise the sort of risky situation I see and suggest ways to avoid similar mistakes. The problem most likely stems from a continued abstraction from the hardware layer that cloud computing brings. This provides great benefits, such as reduced complexity and lower operating costs. But while the look and feel is of something no longer akin to ’traditional’ IT, the hardware used still has the same moving pieces and is subject to the same failed parts we’re used to.
Understand your SLAs
The cloud still has the possibility for failure unless you take the right precautions. Also, no matter how many spare parts you have and how much maintenance is undertaken, there is little that can be done in the face of a natural disaster. Preparing for this eventuality, however, will help.
Let’s take an Azure component that, out of the box, gives a 99.99 per-cent financially-backed availability SLA, and inbuilt backup features giving you a point-in-time restore every 10 minutes for the last 35 days. This means there is sufficient redundancy and self-healing in the underlying infrastructure running SQL resource that Microsoft does not expect it to be down for more than 4.38 minutes in a given month. Microsoft is so confident of this, it has put its money where its mouth is and will return service credits back to customers if these SLAs are breached.
So, let’s look at the risky situation I mentioned. A customer has their new app developed and ready to go. They need some infrastructure, let’s say a web app and some Azure SQL. They’ve selected their components, provided some application-specific information and deployed the resources. In a matter of minutes, they had their resources available and deployed my application code. So far, so good. They test the app and users are happy – it’s fast and responsive. The business is happy, we’re meeting their requirements and it’s costing them less money than it did before. The project has been a success. So, what’s the problem?
Know your risk
What would the customer say if asked: “What recovery time objective (RTO) and Recovery Point Objective (RPO) objectives does the business have?” They could reply: “They have requested one hour, but we have 99.99% uptime, so it would only be down for a few minutes a month. The business is very happy.” To say there’s a 99.99 per-cent uptime SLA is correct but this only applies to a single Azure region. If that region was to suffer an issue, be it human, technical, mechanical or environmental, that stopped services being delivered from this region, your application would be down and likely for a lot longer than five minutes!
Many people are simply not aware you need to consider designing for multiple Azure regions to avoid the risk of downtime.
Empower your business
Now, these outages don’t happen often, over the years there have only been a handful of them, but they do happen. This means, if you’re involved in agreeing disaster recovery (DR) objectives in a business, it’s important to ensure you address the risk of a regional failure. You can then look towards the mitigation steps necessary. This allows the business to make an informed decision as to the ultimate solution design. If you don’t have DR as part of your design considerations in this way, the business is at risk of downtime without even knowing it.
Some DR measures may be cost prohibitive, but at least you can empower the business to understand the risk of not having them and weigh-up the costs accordingly. If you have any immediate concerns regarding your new or existing Azure application deployments, or are considering Azure for new projects, it’s worth reviewing your plans for DR.
If you need support understanding your options, please reach out to Gareth at Gareth.Monk@rackspace.co.uk