Managing SLA’s for Azure Virtual Machines

At Realdolmen we have a lot of customers using Azure Virtual Machines. A misconception we see often is the Service Level Agreement regarding the availability of these machines or the application installed on this Virtual Machine.

Microsoft Azure has great features to ensure you have an SLA suiting your business needs. But, you will need to make your application high-available, on your own. Using Virtual Machines, we have three options:

  • Availability Sets
  • Availability Zones (still in preview for now)
  • Premium Disks

Option A: Availability Sets

Starting with an Availability Set, you will need to deploy two or more VM’s and load balance the application on top of these VM’s. When using these Availability Sets, you can separate your VM’s with fault and update domains.

A fault domain defines a group of virtual machines who share a common power source and network switch. This limits the impact of potential hardware, network or power failures. Microsoft also needs to update and upgrade its own backbone, therefore we have update domains. When there is the need to reboot the underlying hardware for maintenance purposes, it will do so in phases, starting with the first update domain. When the maintenance of this first update domain has finished, the second one will start after 30 minutes and so on.

When you configured your VM’s with Availability Set(s), Microsoft guarantees connectivity to at least one VM for 99.95% of the time. So what does this actually mean? In one month of 31 days, you will have no more than 21  minutes 55 seconds of downtime. Which is quite nice, no?

Option B: Availability Zones

A few weeks ago, Microsoft announced something new, called Availability Zones. Azure datacenters are currently all over the Globe, supporting 36 regions and 6 more to come (today!). Every region consists of a few zones, reflecting in physical datacenters.

For example, the region West Europe (which we use the most, because it’s geographically the best option) has three zones. These three have their own  power, network, cooling and of course compute, storage and all that other interesting stuff you need in a datacenter.

Normally, when reading this, you should understand the functionality already. When deploying Virtual Machines, you will be able to choose to put them in different zones. Doing this, they are physically and logically separated from each other. Of course, don’t forget, you will still need to load balance your incoming traffic!

The great benefit of this new feature is that you don’t have to think about picking the right fault or update domain for each VM. Just make sure you create two instances (minimum) of each role and place them into different zones. Doing this, you will achieve a 99.99% monthly SLA. This means no more than 4 minutes 23 seconds of downtime!

One small side note for this feature, it’s currently in public preview. Which means, don’t use it for production purposes! While in public preview, it’s only available in West Europe and East US 2 and you will not be able to use this with all VM’s. Right now, you are able to use Av2, Dv2 and DSv2. When this feature becomes Generally Available, you will be able to use every VM size available to you.

Option C: Premium Disks

The last option, is also the option with the smallest impact on the design of your application. Let’s say, your business needs an SLA, but you don’t really have the budget to make your application High-Available with two or more VM’s. Well, we have an option available for this problem. And it’s fairly simple, you just have to make sure your VM uses only premium disks. After deploying your VM with only premium disks, you will receive a 99.9% monthly SLA.

Of course, take in account, you will still have a single point of failure… but talking seconds, this is 4 minutes 23 seconds seconds monthly. Which is quite OK for such a small change.

Wrapping up

So now you know the how to get the correct SLA for the right application. These are the options available today and I can only imagine the list of options will get bigger and better.

Maybe a question regarding the ‘best’ option? Well, there is no answer to that question. “It depends”, is my answer. It’s all about budget and SLA requirements, these are the two big influencers in picking an option from the list above.

If you have any questions, related to this topic, please do not hesitate to contact me. I’m more than happy to help and discuss this with you!



