Thursday, January 21, 2016

VMware vROps - To HA or not to HA?

A lot of my clients want to deploy vROps with High Availability (HA) option enabled. When I hear this, I simply ask them to name at least one business requirement that dictates the use of HA for vROps in their environment. The answer I typically get is something like this: "because we want HA" with some undertones of: how dare you ask, you should know we're a very important company! Well, working as a vROps consultant over the last year, I realized that there are a lot of misconceptions about the vROps HA feature. In this post, I would like to clarify some things about vROps HA and hopefully empower you to make the right decision when designing your vSphere monitoring solution.

With the release of vROps 6.0, VMware introduced HA and all of a sudden everyone wanted it without first trying to understand what vROps HA is. I admit that after the initial product release there was a lot of confusion, but now a year later, after the dust has settled, let's get some things straight.

First things first, so what is vROps HA? Let's get down to it and understand the benefits that HA provides. As the name implies, "High Availability" provides some level of protection against failure. At the most basic level HA will allow you to access and use the vROps solution while one of the nodes has failed for whatever reason. During an HA event, you should be able to log-in to the Product UI, run reports, and view dashboards as if the application was fully available. Data collection will also continue uninterrupted. So what types of failure does HA protect against?
  1. ESXi host failure.
  2. ESX data/VM network connectivity loss.
  3. vROps Service Availability.
  4. Storage Problems.
I would argue that using vSphere HA to protect against the first two may be a better solution, but that depends on your Recovery Time Objective (RTO). Also, it seems that when you have a multi-node cluster without HA, you are still protected against most service issues because the same services run on all nodes. The only exception is the master node xDB functionality that stores our configuration metadata. So it seems that protecting against storage issues is the only truly compelling reason to use HA. So in my humble opinion, you don't need vROps HA in most cases, and here's why:
  1. Is vROPs going to be classified as a Business Critical system? Meaning that if it goes down, will your company lose revenue and/or customers? If no, then you don't need vROps HA because vROps availability does not impact your bottom line.
  2. You say it does not matter and that you have a policy which requires HA for everything. OK then. What are your Recovery Time Objective and Recovery Point Objective (RPO) for non-Business Critical systems? If it's anything higher than 5 minutes, then you don't need vROps HA because vSphere HA and proper backup solution will take care of most issues.
Do you see where I'm going with this yet? Yes, I'm trying to talk you out of using vROps HA because like I said, in most cases it's just "nice to have" but not necessary from the business point-of-view. And since IT's existence is more and more tied to Business Outcomes, justifying the added resource cost and complexity for questionable business value is very hard.

Now, don't get me wrong, there are still perfectly good business reasons to enable vROps HA, so let's talk about what you're getting yourself into:
  1. As I hinted before, there is increased resource cost. When you enable vROps HA, it requires double the compute and storage resources to support the same number of collected objects and metrics you designed your environment for. For example, let's say your sizing calculations called for 4 medium nodes. If you enable HA, you will need 8 medium nodes. Remember, all nodes must be equal in size, you can't mix and match. You will need the additional compute power to calculate the parity information for duplicating the metric data across all of the nodes. Obviously, you need the additional storage for storing the duplicate metric data. Can your organization handle the increased resource cost for something that's just nice to have? Continuing with our 4 medium node example consuming 32 vCPUs, 128GB of memory, and 4TB of storage, doubling that would put us at 64 vCPUs, 256GB of memory, and 8TB of storage. Like I said, the costs can add up quick.
  2. Another consideration is physical server availability in your vSphere Cluster hosting the vROps solution. Do you have enough ESXi hosts for one-to-one vROps node-to-ESXi host relationship? Why? It would be bad design if you had more than one vROps node per ESXi host. Remember, vROps HA protects against single node failure only. If you lose a host with 2 vROps nodes on it, the solution will experience critical failure. This may also lead to data corruption and you may have to recover from backup. So when using vROps HA, host and datastore anti-affinity rules are a must. If you have to buy more physical hosts and storage to accommodate vROps HA, then that's increased cost and wait time.
  3. You had a vROps HA event. Now what? Have you considered what to do in case you have a vROps HA event? Typically, you will have a vROps HA event for one of two reasons: A) ESXi host failure, or B) storage issue. In the first case, it's pretty straight forward: vSphere HA will kick in and start your node on another host. You log into the Admin UI and reestablish HA, easy. Let's consider the second scenario now: you lost a Datastore, someone deleted a LUN, and one of your vROps nodes is gone. Now what? Well, you will either have to restore the entire vROps cluster from backup, or evict the damaged node from cluster, deploy a new one and add it back. If you restored then you're back in business. However, if you deployed new node then you have to reestablish HA and wait for hours until all data gets replicated to the new node before HA is fully functional again. What!?! You thought it was automagic? You set it and forget, and vROps takes care of itself. No way! Just like with hard disk RAID systems, someone has to take action to replace the failed disk and initiate a rebuild. This is very similar. This is why using vROps HA for protecting against ESXi host failure has very little to no benefit. It could take longer to restore the vROps HA cluster than just relying on plain old vSphere HA. In the second case, it boils down to your RTO and RPO, and how long of an vROps outage you can afford. Do you have a plan to deal with this?
In conclusion, I think that vROps HA is an amazing technology and works great when truly needed and properly designed, configured and maintained. I've seen it do its job and was impressed with how well it worked while meeting specific customer's business objectives. However, as you may be realizing by now, vROps HA introduces additional complexity and costs. Is this complexity worth the additional cost, time and effort if it's just nice to have? Maybe money is no object and you have time to tinker, but is it the best use of your time and your company's resources? Sometimes keeping things simple is the better design. Working in IT over the years, I've seen my share of over-engineered systems that become alligators you have to constantly feed or they get really hungry and end up eating you ;-)

There is one final note I would like to make about vROps HA. For some reason, when people hear vROps HA, they think they can stretch a vROps cluster between two geographically disparate datacenters, like a vSphere Cluster. Although possible, this should never be done, period! Why? Because the product was not designed to do that. This is because the underlying in-memory-database technology that provides HA, GemFire, is very latency sensitive and will fail if you spread the cluster nodes among datacenters. Then you say, "well, we have dark fiber and latency is not an issue for us." Still, don't do it! It will work for a while and then fail and you will be spending long hours on the phone with VMware Support trying to undo the mess you created. Trust me on this one. I've seen this, so don't try it at home. Besides latency, another common problem that stretched vROps HA clusters run into is an isolation event when the link between two sites goes down. Is such cases, the HA does not work because the cluster just lost half of its nodes and that's a critical failure which HA does not protect against. For example, let's say we have a 4 node HA cluster with 2 nodes (master and data node) at datacenter A and two other nodes (master replica and another data node) at datacenter B. Since HA protects against single node failure, when we lose communication to the other site with 2 nodes, that's more than HA can handle so things collapse. The cluster dies and HA can't do anything about it. Same thing applies no matter how many nodes you have and how they are distributed between the sites. This is why vROps HA is not a disaster recovery (DR) solution and only protects against single node failure only.

I hope this vROps HA reality check will help you make a better decision, regardless of your final choice. Just don't get carried away, and keep things simple.


  1. Excellent article!! Had everything on one page that i was looking for. Single node per DC it is..

  2. Great points it comes down client per client choice. Every where I put vROps the ROI and viability of product has overshadow the extra disk and compute.

    But with more object as ever it's somthing everyone should at least consider.

  3. Great article but I still have a doubt about a specific scenario.

    Suppose to have a vROPS cluster composed by 3 nodes:

    - 1 master node
    - 1 master replica
    - 1 data node

    In this situation if Master node goes down the Replica Master will take place. But what is going to happen if the data node became unresponsive? Does the master or replica master replace the role of the data node without lose the metrics collected by data node unresponsive?

    Would be better to have 4 nodes, so 1 master 1 replica and 2 data nodes?



    1. All nodes in a vROps Analytics Cluster have performance metric data on them, including Master, Master Replica and Data Nodes. Only Remote Collectors do not store any data. Master and Master Replica have additional configuration metadata not applicable for this conversation.
      If you enable vROps HA and lost any node including a Data Node there will be no data loss, however you may see a performance degradation due to lower number of nodes left online. HA protects against data loss and provides application access in the event of a single node loss only.
      For production when enabling HA, I would not recommending anything less than 4 Medium (8x32) nodes. 2 and 3 node configurations are only good for testing in a lab. Good luck!

  4. Great article! In vROPs 6.x, to remove HA you just need to disable the feature and vROPS converts the replica node back to a data node and restarts the cluster.
    Do you know what happen with the replicated objects and metrics? are they just deleted from (former) replica nodes? Or have to wait until the data is deleted accordingly to the data retention period parameter?


    1. That's a very good question SR. I've never thought about it and actually don't know the answer. I'll pass it along to our product team and see if they can provide some guidance. In the meantime you could try an experiment and report back. Before you disable HA mode, monitor the DB Disk Space on all of your nodes and see if it goes down. As you know, in HA mode data is replicated among all of the nodes, so if you disable HA the duplicate data should go away, but I'm not 100% sure if there is a cleanup process that actually does this. I'm interested to find out myself!
      Other than that, the Master Replica node has some Analytics Cluster configuration data replicated from the Master node in a Cassandra database. This a very insignificant amount of data so probably does not take whole lot of disk space if not cleaned up. Thanks,

    2. Hi Peter, I've done my tests on a 6.5 lab environment and seems that vROPS does not delete duplicate data once HA is disabled. That's sad as AFAIK there is no procedure to clean up that data.