VMware vROps - To HA or not to HA?

By vbulosity - January 21, 2016

A lot of my clients want to deploy vROps with High Availability (HA) option enabled. When I hear this, I simply ask them to name at least one business requirement that dictates the use of HA for vROps in their environment. The answer I typically get is something like this: "because we want HA" with some undertones of: how dare you ask, you should know we're a very important company! Well, working as a vROps consultant over the last year, I realized that there are a lot of misconceptions about the vROps HA feature. In this post, I would like to clarify some things about vROps HA and hopefully empower you to make the right decision when designing your vSphere monitoring solution.

With the release of vROps 6.0, VMware introduced HA and all of a sudden everyone wanted it without first trying to understand what vROps HA is. I admit that after the initial product release there was a lot of confusion, but now a year later, after the dust has settled, let's get some things straight.

First things first, so what is vROps HA? Let's get down to it and understand the benefits that HA provides. As the name implies, "High Availability" provides some level of protection against failure. At the most basic level HA will allow you to access and use the vROps solution while one of the nodes has failed for whatever reason. During an HA event, you should be able to log-in to the Product UI, run reports, and view dashboards as if the application was fully available. Data collection will also continue uninterrupted. So what types of failure does HA protect against?

ESXi host failure.
ESX data/VM network connectivity loss.
vROps Service Availability.
Storage Problems.

I would argue that using vSphere HA to protect against the first two may be a better solution, but that depends on your Recovery Time Objective (RTO). Also, it seems that when you have a multi-node cluster without HA, you are still protected against most service issues because the same services run on all nodes. The only exception is the master node xDB functionality that stores our configuration metadata. So it seems that protecting against storage issues is the only truly compelling reason to use HA. So in my humble opinion, you don't need vROps HA in most cases, and here's why:

Is vROPs going to be classified as a Business Critical system? Meaning that if it goes down, will your company lose revenue and/or customers? If no, then you don't need vROps HA because vROps availability does not impact your bottom line.
You say it does not matter and that you have a policy which requires HA for everything. OK then. What are your Recovery Time Objective and Recovery Point Objective (RPO) for non-Business Critical systems? If it's anything higher than 5 minutes, then you don't need vROps HA because vSphere HA and proper backup solution will take care of most issues.

Do you see where I'm going with this yet? Yes, I'm trying to talk you out of using vROps HA because like I said, in most cases it's just "nice to have" but not necessary from the business point-of-view. And since IT's existence is more and more tied to Business Outcomes, justifying the added resource cost and complexity for questionable business value is very hard.

Now, don't get me wrong, there are still perfectly good business reasons to enable vROps HA, so let's talk about what you're getting yourself into:

As I hinted before, there is increased resource cost. When you enable vROps HA, it requires double the compute and storage resources to support the same number of collected objects and metrics you designed your environment for. For example, let's say your sizing calculations called for 4 medium nodes. If you enable HA, you will need 8 medium nodes. Remember, all nodes must be equal in size, you can't mix and match. You will need the additional compute power to calculate the parity information for duplicating the metric data across all of the nodes. Obviously, you need the additional storage for storing the duplicate metric data. Can your organization handle the increased resource cost for something that's just nice to have? Continuing with our 4 medium node example consuming 32 vCPUs, 128GB of memory, and 4TB of storage, doubling that would put us at 64 vCPUs, 256GB of memory, and 8TB of storage. Like I said, the costs can add up quick.
Another consideration is physical server availability in your vSphere Cluster hosting the vROps solution. Do you have enough ESXi hosts for one-to-one vROps node-to-ESXi host relationship? Why? It would be bad design if you had more than one vROps node per ESXi host. Remember, vROps HA protects against single node failure only. If you lose a host with 2 vROps nodes on it, the solution will experience critical failure. This may also lead to data corruption and you may have to recover from backup. So when using vROps HA, host and datastore anti-affinity rules are a must. If you have to buy more physical hosts and storage to accommodate vROps HA, then that's increased cost and wait time.
You had a vROps HA event. Now what? Have you considered what to do in case you have a vROps HA event? Typically, you will have a vROps HA event for one of two reasons: A) ESXi host failure, or B) storage issue. In the first case, it's pretty straight forward: vSphere HA will kick in and start your node on another host. You log into the Admin UI and reestablish HA, easy. Let's consider the second scenario now: you lost a Datastore, someone deleted a LUN, and one of your vROps nodes is gone. Now what? Well, you will either have to restore the entire vROps cluster from backup, or evict the damaged node from cluster, deploy a new one and add it back. If you restored then you're back in business. However, if you deployed new node then you have to reestablish HA and wait for hours until all data gets replicated to the new node before HA is fully functional again. What!?! You thought it was automagic? You set it and forget, and vROps takes care of itself. No way! Just like with hard disk RAID systems, someone has to take action to replace the failed disk and initiate a rebuild. This is very similar. This is why using vROps HA for protecting against ESXi host failure has very little to no benefit. It could take longer to restore the vROps HA cluster than just relying on plain old vSphere HA. In the second case, it boils down to your RTO and RPO, and how long of an vROps outage you can afford. Do you have a plan to deal with this?

In conclusion, I think that vROps HA is an amazing technology and works great when truly needed and properly designed, configured and maintained. I've seen it do its job and was impressed with how well it worked while meeting specific customer's business objectives. However, as you may be realizing by now, vROps HA introduces additional complexity and costs. Is this complexity worth the additional cost, time and effort if it's just nice to have? Maybe money is no object and you have time to tinker, but is it the best use of your time and your company's resources? Sometimes keeping things simple is the better design. Working in IT over the years, I've seen my share of over-engineered systems that become alligators you have to constantly feed or they get really hungry and end up eating you ;-)

There is one final note I would like to make about vROps HA. For some reason, when people hear vROps HA, they think they can stretch a vROps cluster between two geographically disparate datacenters, like a vSphere Cluster. Although possible, this should never be done, period! Why? Because the product was not designed to do that. This is because the underlying in-memory-database technology that provides HA, GemFire, is very latency sensitive and will fail if you spread the cluster nodes among datacenters. Then you say, "well, we have dark fiber and latency is not an issue for us." Still, don't do it! It will work for a while and then fail and you will be spending long hours on the phone with VMware Support trying to undo the mess you created. Trust me on this one. I've seen this, so don't try it at home. Besides latency, another common problem that stretched vROps HA clusters run into is an isolation event when the link between two sites goes down. Is such cases, the HA does not work because the cluster just lost half of its nodes and that's a critical failure which HA does not protect against. For example, let's say we have a 4 node HA cluster with 2 nodes (master and data node) at datacenter A and two other nodes (master replica and another data node) at datacenter B. Since HA protects against single node failure, when we lose communication to the other site with 2 nodes, that's more than HA can handle so things collapse. The cluster dies and HA can't do anything about it. Same thing applies no matter how many nodes you have and how they are distributed between the sites. This is why vROps HA is not a disaster recovery (DR) solution and only protects against single node failure only.

I hope this vROps HA reality check will help you make a better decision, regardless of your final choice. Just don't get carried away, and keep things simple.

For more information about vROps, see the following resources:

Books:

VMware vRealize Operations Managers Essentials by Matthew Steiner

Mastering vRealize Operations Manager by Scott Norris
VMware vRealize Operations Manager Capacity and Performance Management by Iwan 'e1' Rahabok

Official VMware:
VMware Professional Services
Official vROps Documentation
VMware Operations Management White Papers
Extensibility and Management Packs
vROps product page

Blogs:
vXpress by @Sunny_Dua
virtual red dot by @e1_ang
Virtualise Me by @auScottNorris
Elastic Sky Labs by @JAGaudreau
i'm all vIRTUAL by @LiorKamrat

Comments

UnknownAugust 2, 2016 at 8:26 AM
Excellent article!! Had everything on one page that i was looking for. Single node per DC it is..
ReplyDelete
Replies
UnknownSeptember 4, 2016 at 8:37 AM
Great points it comes down client per client choice. Every where I put vROps the ROI and viability of product has overshadow the extra disk and compute.

@dcd270
But with more object as ever it's somthing everyone should at least consider.
ReplyDelete
Replies
UnknownOctober 11, 2016 at 11:45 PM
Excellent article !!!
ReplyDelete
Replies
UnknownJanuary 20, 2017 at 10:32 AM
Great article but I still have a doubt about a specific scenario.

Suppose to have a vROPS cluster composed by 3 nodes:

- 1 master node
- 1 master replica
- 1 data node

In this situation if Master node goes down the Replica Master will take place. But what is going to happen if the data node became unresponsive? Does the master or replica master replace the role of the data node without lose the metrics collected by data node unresponsive?

Would be better to have 4 nodes, so 1 master 1 replica and 2 data nodes?

thanks,

Fausto
ReplyDelete
Replies
SRMay 5, 2017 at 9:49 AM
Great article! In vROPs 6.x, to remove HA you just need to disable the feature and vROPS converts the replica node back to a data node and restarts the cluster.
Do you know what happen with the replicated objects and metrics? are they just deleted from (former) replica nodes? Or have to wait until the data is deleted accordingly to the data retention period parameter?

Thanks!
Santiago.-
ReplyDelete
Replies

Add comment

Search This Blog

vbulosity

VMware vROps - To HA or not to HA?

VMware vRealize Operations Managers Essentials by Matthew Steiner

Comments

Post a Comment

Popular posts from this blog

VMware vROps - vROps Policies Demystified

VMware vROps - Custom Dashboard Design Part 1

VMware vROps - vSphere Cluster Capacity and Performance Dashboard Part 1